Depending on the domain, the reality can be the reverse.
Multiprocessing in the web serving domain, as in "spawning separate processes", is actually simpler and less bug-prone, because there is considerably less resource sharing. The considerably higher difficulty of writing, testing and debugging parallel code is evident to anybody who's worked on it.
As for the overhead, this again depends on the domain. It's hard to quantify, but generalizing to "massive" is not accurate, especially for app servers with COW support.
The default for multiprocessing is still to fork (fortunately changing in 3.14), which means all of your parent process’ threaded code (incl. third party libraries) has to be fork-safe. There’s no static analysis checks for this.
This kind of easy to use but incredibly hard to use safely library has made python for long running production services incredibly painful in my experience.
[1] Some arguments to subprocess.popen look handy but actually cause python interpreter code to be executed after the fork and before the execve, which has caused production logging-related deadlocks for me. The original author was very bright but didn’t notice the footgun.
If I may: Changing from fork to what?
As an aside I still constantly see side effects in imports in a ton of libraries (up to and including resource allocations).
Compared to theads being "pain free"?
I feel like all of this is tragic and Python should have gone to a BEAM-like model some years ago, like as part of the 2 to 3 transition. Instead we get async wreckage and now free threading with its attendant hazards. Plus who knows how many C modules won't be expecting this.
Also, I've found that ChatGPT/Claude3.5 are much, much smarter and better at Python than they are at C++ or Rust. I can usually get code that works basically the first or second time with Python, but very rarely can do that using those more performant languages. That's increasingly a huge concern for me as I use these AI tools to speed up my own development efforts very dramatically. Computers are so fast already anyway that the ceiling for optimization of network oriented software that can be done in a mostly async way in Python is already pretty compelling, so then it just comes back again to developer productivity, at least for my purposes.
It will be interesting to see how this goes over the next few years. My guess is that a lot of lessons were learned from the python 2 to 3 move. This plan seems pretty solid.
And of course there's a relatively easy fix for code that can't work without a GIL: just do what people are doing today and just don't fork any threads in python. It's kind of pointless in any case with the GIL in place so not a lot of code actually depends on threads in python.
Preventing the forking of threads in the presence of things still requiring the GIL sounds like a good plan. This is a bit of meta data that you could build into packages. This plan is actually proposing keeping track of what packages work without a GIL. So, that should keep people safe enough if dependency tools are updated to make use of this meta data and actively stop people from adding thread unsafe packages when threading is used.
So, I have good hopes that this is going to be a much smoother transition than python 2 to 3. The initial phase is probably going to flush out a lot of packages that need fixing. But once those fixes start coming in, it's probably going to be straightforward to move forward.
AMD EPYC 9754 with 128-cores/256-threads, and EPYC 9734 with 112-cores/224-threads. TomsHardware says they "will compete with Intel's 144-core Sierra Forest chips, which mark the debut of Intel's Efficiency cores (E-cores) in its Xeon data center lineup, and Ampre's 192-core AmpereOne processors".
What in 5 years? 10? 20? How long will "1 core should be enough for anyone using Python" stand?
There is code that may benefit from the free threaded implementation but it is not as often as it might appear and it is not without its own downsides. In general, GIL simplifies multithreaded code.
There were no-GIL Python implementations such as Jython, IronPython. They hadn't replaced CPython, Pypy implementation which use GIL i.e., other concerns dominate.
If you're looking for a 32x or 128x performance improvement from python supporting multi-core you should probably rewrite in C, C++, Rust, or Fortran and get that 100x improvement today on a single core. If done properly you can then ALSO get the gain from multiple cores on top of that. Or to put it another way, if performance is critical python is a poor choice.
A piece of code takes 6h to develop in C++, and 1h to run.
The same algorithm takes 3h to code in Python, but 6h to run.
If I could thread-spam that Python code on my 24 core machine, going Python would make sense. I've certainly been in such situations a few times.
Julia is one that is gaining a lot of use in academia, but any number of modern, garbage collected compiled high level languages could probably do.
Every DL library comes with its own C++ backend that does this for now, but it's annoyingly inflexible. And dealing with GIL is a nightmare if you're dealing with mixed Python code.
IDK what l should and shouldn't be written in, but there are a very large # of proud "pure Python" libraries on GitHub and HN.
The ecosystem seems to even prefer them.
Why shouldn't someone who prefers writing in python benefit from using multiple cores?
I did use the words "most things". I'm not saying this is a bad development for Python, or that nobody should use it. But if performance is a top priority, Python is the wrong language and always has been.
I use Python from time to time, it's fun and easy to put certain kinds of things together quickly. But each time I do a project with it, the first thing I ask myself is "is this going to be fast enough?" If not I'll use something else.
This just isn’t true.
This does not improve single threaded performance (it’s worse) and concurrent programming is already available.
This will make it less annoying to do concurrent processing.
It also makes everything slower (arguable where that ends up, currently significantly slower) overall.
This way over hyped.
At the end of the day this will be a change that (most likely) makes the existing workloads for everyone slightly slower and makes the lives of a few people a bit easier when they implement natively parallel processing like ML easier and better.
It’s an incremental win for the ML community, and a meaningless/slight loss for everyone else.
At the cost of a great. Deal. Of. Effort.
If you’re excited about it because of the hype and don’t really understand it, probably calm down.
Mostly likely, at the end of the day, it s a change that is totally meaningless to you, won’t really affect you other than making some libraries you use a bit faster, and others a bit slower.
Overall, your standard web application will run a bit slower as a result of it. You probably won’t notice.
Your data stack will run a bit faster. That’s nice.
That’s it.
Over hyped. 100%.
The rest of us can live with arcane threading bugs and yet another split ecosystem. As I understand it, if a single C-extension opts for the GIL, the GIL will be enabled.
Of course the invitation to experiment is meaningless. CPython is run by corporations, many excellent developers have left and people will not have any influence on the outcome.
Man, that phrase perfectly encapsulates so much of Python’s evolution over the last ~10 years.
If you assume two completely separate implementations where there is an #ifdef every 10 lines and atomics and locking only occur with --disable-gil, there is no slowdown for the --enable-gil build.
I don't think that is entirely the case though!
If the --enable-gil build becomes the default in the future, then peer pressure and packaging discipline will force everyone to use it. Then you have the OBVIOUS slowdown of atomics and of locking the reference counting and in other places.
The advertised figures were around 20%, which would be offset by minor speedups in other areas. But if you compare against Python 3.8, for instance, the slowdowns are still there (i.e., not offset by anything). Further down on the second page of this discussion numbers of 30-40% have been measured by the submitter of this blog post.
Actual benchmarks of Python tend to be suppressed or downvoted, so they are not on the first page. The Java HotSpot VM had a similar policy that forbid benchmarks.
^ read. The OP responds in the thread.
tldr, literally what I said:
> It also makes everything slower (arguable where that ends up, currently significantly slower) overall.
longer version:
If there was no reason for it to be slower, it would not be slower.
...but, implementing this stuff is hard.
Doing a zero cost implementation is really hard.
It is slower.
Where it ends up eventually is still a 'hm... we'll see'.
To be fair, they didn't lead the article here with:
> Right now there is a significant single-threaded performance cost. Somewhere from 30-50%.
They should have, because now people have a misguided idea of what this wip release is... and that's not ideal; because if you install it, you'll find its slow as balls; and that's not really the message they were trying to put out with this release. This release was about being technically correct.
...but, it is slow as balls right now, and I'm not making that shit up. Try it yourself.
/shrug
But, for sure, nogil will be good for those workloads written in pure Python (though I've personally never been affected by that).
I use coroutines and multiprocessing all the time, and saturate every core and all the IO, as needed. I use numpy, pandas, xarray, pytorch, etc.
How did this terrible GIL overhead completely went unnoticed?
That means your code is using python as glue and you do most of your work completely outside of cPython. That's why you don't see the impact - those libraries drop GIL when you use them, so there's much less overhead.
I've never heard threading described as "simple", even less so as simpler than multiprocessing.
Threads means synchronization issues, shared memory, locking, and other complexities.
Everyone wants parallelism in Python. Removing the GIL isn't the only way to get it.
I'm saturating 192cpu / 1.5TBram machines with no headache and straightforward multiprocessing. I really don't see what multithreading will bring more.
What are these massive overheads / complexity / bugs you're talking about ?
[x] Async.
[x] Optional static typing.
[x] Threading.
[ ] JIT.
[ ] Efficient dependency management.I think the real solution here is to just only use python dependency management for python things and to use something like nix for everything else.
Between pip, poetry and pyproject.toml, things are now quite good IMHO.
This is not a requirement for a language to be statically typed. Static typing is about catching type errors before the code is run.
> Type hint a var as a string then set it to an int, that code still gonna try to execute.
But it will fail type checking, no?
In static typing the types of variables don't change during execution.
So you can do things like “from typing import Optional” to bring Optional into scope, and then annotate a function with -> Optional[int] to indicate it returns None or an int.
Unlike a system using special comments for type hints, the interpreter will complain if you make a typo in the word Optional or don’t bring it into scope.
But the interpreter doesn’t do anything else; if you actually return a string from that annotated function it won’t complain.
You need an external third party tool like MyPy or Pyre to consume the hint information and produce warnings.
In practice it’s quite usable, so long as you have CI enforcing the type system. You can gradually add types to an existing code base, and IDEs can use the hint information to support code navigation and error highlighting.
It would be super helpful if the interpreter had a type-enforcing mode though. All the various external runtime enforcement packages leave something to be desired.
Works pretty efficiently.
BTW, Typescript also does not enforce types at runtime. Heck, C++ does not enforce types at runtime either. It does not mean that their static typing systems don't help during at development time.
Speaking of C here as I don't have web development experience. The static type system does help, but in this case, it's the compiler doing the check at compile time to spare you many surprises at runtime. And it's part of the language's standard. Python itself doesn't do that. Good that you can use external tools, but I would prefer if this was part of Python's spec.
Edit: these days I'm thinking of having a look at Mojo, it seems to do what I would like from Python.
python -c "x: int = 'not_an_int'"
My opinion is that with PEP 695 landing in Python 3.12, the type system itself is starting to feel robust.
These days, the python ecosystem's key packages all tend to have extensive type hints.
The type checkers are of varying quality; my experience is that pyright is fast and correct, while mypy (not having the backing of a Microsoft) is slower and lags on features a little bit -- for instance, mypy still hasn't finalized support for PEP 695 syntax.
There are also multiple compilers (mypyc, nuitka, others I forget) which take advantage of types to compile python to machine code.
Just try:
$ Python
>>> 1 + '3'The other tools are trivially easy to set up and run (or let your IDE run for you.) As in, one command to install, one command to run. It's an elegant compromise that brings something that's sorely needed to Python, and users will spend more time loading the typing spec in their browser than they will installing the type checker.
There are areas where typing is more important: public interfaces. You don't have to make every piece of your program well-typed. But signatures of your public functions / methods matter a lot, and from them types of many internal things can be inferred.
If your code has a well-typed interface, it's pleasant to work with. If interfaces of the libraries you use are well-typed, you have easier time writing your code (that interacts with them). Eventually you type more and more code you write and alter, and keep reaping the benefits.
IMHO Python should shamelessly steal as much typescript’s typing as possible. It’s tough since the Microsoft typescript team is apparently amazing at what they do so for now it’s a very fast moving target but some day…
what it has is "type hints" which is way to have richer integration with type checkers and your IDE, but will never offer more than that as is
Python is strongly typed and it's interpreter is type aware of it's variables, so you're probably overreaching with that statement. Because Python's internals are type aware, it's how folks are able to create type checkers like mypy and pydantic both written in Python. Maybe you're thinking about TS/JSDoc, which is just window dressing for IDEs to display hints as you described?
https://github.com/mypyc/mypyc
You can compile python to c. Right now. Compatibility with extensions still needs a bit of work. But you can write extremely strict python.
That's without getting into things like cython.
For efficient dependency management, there is now rye and UV. So maybe you can check all those boxes?
So there's plenty of well-founded hope, but the boxes are still not checked.
[X] print requires parentheses[0] https://stackoverflow.com/questions/56262012/conda-install-t...
https://www.anaconda.com/blog/a-faster-conda-for-a-growing-c...
I regularly encounter python code which takes minutes to execute but runs in less than a second when replacing key parts with compiled code.
This is a big fundamental and (in many cases breaking) change, even if it's "optional".
There were a lot of smaller breaking changes over the years, especially 3.10 that probably should have been a 4.0.
I’m looking forward to seeing how people use a Python that can be meaningfully threaded. While It may take a bit to built momentum, I suspect that in a few years there’ll be obvious use cases that are widely deployed that no one today has even really considered.
So far, I've rarely seen that. Best example I deal with was a networking project with lots of communication across threads, and that one was too performance-sensitive to even use C++, let alone Py. Other things I can think of are OS programming which again has to be C or Rust.
There have been patches to remove the GIL going back to the 90s and Python 1.5 or thereabouts. But the performance impact has always been the show-stopper.
This post is a call to ask people to “kick the tires”, experiment, and report issues they run into, not announcing that all work is done.
So the net is actually a small performance win but lesser than if there was no free threading. That said, many of the techniques he identified were immediately incorporated into CPython and so I would expect benchmarks to show some regression as compared with the single threaded interpreter of the previous revision.
Meanwhile what takes the crown? - Single threaded python.
(Well, ok Rust looks like it's taking first place where you really need the speed and it does help parallelism without requiring absolute purity)
Any python library that cares about performance is written in C/C++/Rust/Fortran and only provides a python interface.
ML will have 0 benefit from this.
Is there a cibuildwheel / CI check for free-threaded Python support?
Is there already a reason not to have Platform compatibility tags for free-threaded cpython support? https://packaging.python.org/en/latest/specifications/platfo...
Is there a hame - a hashtaggable name - for this feature to help devs find resources to help add support?
Can an LLM almost port in support for free-threading in Python, and how should we expect the tests to be insufficient?
"Porting Extension Modules to Support Free-Threading" https://py-free-threading.github.io/porting/
[1] "Python 3 "Wall of Shame" Becomes "Wall of Superpowers" Today" https://news.ycombinator.com/item?id=4907755
(Edit)
Compatibility status tracking: https://py-free-threading.github.io/tracking/
python-feedstock / recipe / meta.yml: https://github.com/conda-forge/python-feedstock/blob/master/...
pypy-meta-feedstock can be installed in the same env as python-feedstock; https://github.com/conda-forge/pypy-meta-feedstock/blob/main...
sudo dnf install python3.13-freethreading
sudo add-apt-repository ppa:deadsnakes
sudo apt-get update
sudo apt-get install python3.13-nogil
conda create -n nogil -c defaults -c ad-testing/label/py313_nogil python=3.13
mamba create -n nogil -c defaults -c ad-testing/label/py313_nogil python=3.13
TODO: conda-forge ?, pixiI'd love to see a more fluid model between the two -- E.G. if I'm doing a "gather" on CPU-bound coroutines, I'm curious if there's something that can be smart enough to JIT between async and multithreaded implementations.
"Oh, the first few tasks were entirely CPU-bound? Cool, let's launch another thread. Oh, the first few threads were I/O-bound? Cool, let's use in-thread coroutines".
Probably not feasible for a myriad of reasons, but even a more fluid programming model could be really cool (similar interfaces with a quick swap between?).
If you're serving HTTP requests, for instance, simply serving each request on its own thread with its own event loop should be sufficient at scale. Multiple requests each with CPU-bound tasks will still saturate the CPUs.
Very little code teeters between CPU-bound and io-bound while also serving few enough requests that you have cores to spare to effectively parallelize all the CPU-bound work. If that's the case, why do you need the runtime to do this for you? A simple profile would show what's holding up the event loop.
But still, the runtime can't naively parallelize coroutines. Coroutines are expected not to be run in parallel and that code isn't expected to be thread safe. Instead of a gather on futures, your code would have been using a thread pool executor in the first place if you'd gone out of your way to ensure your CPU-bound code was thread safe: the benefits of async/await are mostly lost.
I also don't think an event loop can be shared between two running threads: if you were to parallelize coroutines, those coroutines' spawned coroutines could run in parallel. If you used an async library that isn't thread safe because it expects only one coroutine is executing at a time, you could run into serious bugs.
This is exactly where I'd like to see it.
I'd like to simultaneously:
1. Call out to external APIs and not run any overhead/complexity of creating/managing threads 2. Call out to a model on a CPU and not have it block the event loop (I want it to launch a new thread and have that be similar to me) 3. Call out to a model on a GPU, ditto
And use the observed resource CPU/GPU usage to scale up nicely with an external horizontal scaling system.
So it might be that the async API is a lot easier to use/ergonomic then threads. I'd be happy to handle thread-safety (say, annotating routines), but as you pointed out, there are underlying framework assumptions that make this complicated.
The solution we always used is to separate out the CPU-bound components from the IO-bound components, even onto different servers or sidecar processes (which, effectively, turn CPU-bound into IO-bound operations). But if they could co-exist happily, I'd be very excited. Especially if they could use a similar API as async does.
Maybe if you’ve got an embarrassingly parallel problem, and dozen(s) of cores to spare, you can match the performance of a single-threaded JIT/AOT compiled program.
However, they simply have too much code to rewrite it all in another language. Hence the attempts recently to fundamentally change Python itself to make it more suitable for large-scale codebases.
<rant>And IMO less suitable for writing small scripts, which is what the majority of Python programmers are actually doing.</rant>
It’s much worse except in everything but a threaded test
-Episode 2: Removing the GIL[1]
-Episode 12: A Legit Episode[2]
[1]https://www.youtube.com/watch?v=jHOtyx3PSJQ&list=PLShJCpYUN3...
[2]https://www.youtube.com/watch?v=IGYxMsHw9iw&list=PLShJCpYUN3...
What about simple operations like incrementing an integer? IIRC this is currently thread-safe because the GIL guarantees each bytecode instruction is executed atomically.
I guess the only things that are a single instruction are some modifications to mutable objects, and those are already heavyweight enough that it’s OK to add a per-object lock.
I've done quite a bit of stuff with Java and Kotlin in the past quarter century and it's interesting to see how much things have evolved. Early on there were a lot of people doing silly things with threads and overusing the, at the time, not so great language features for that. But a lot of that stuff replaced by better primitives and libraries.
If you look at Kotlin these days, there's very little of that silliness going on. It has no synchronized keyword. Or a volatile keyword, like Java has. But it does have co-routines and co-routine scopes. And some of those scopes may be backed by thread pools (or virtual thread pools on recent JVMs).
Now that python has async, it's probably a good idea to start thinking about some way to add structured concurrency similar to that on top of that. So, you have async stuff and some of that async stuff might happen on different threads. It's a good mental model for dealing with concurrency and parallelism. There's no need to repeat two decades of mistakes that happened in the Java world; you can fast forward to the good stuff without doing that.
Really excited about this.
With it, the single-threaded case is slower.
The link should have been to https://py-free-threading.github.io/tracking/