Go look at the sourcecode. Look at how simple it is - anyone who has created a thread with java knows what's happening. With only minor tweaks, this means your pre-existing code can take advantage of this with, basically, no effort. And it retains all the debuggability of traditional java thread (I.e: a stack trace that makes sense!)
If you've spent any time at all dealing with the horrors of c# async/await (Why am I here? Oh, no idea) and it's doubling of your APIs to support function colouring - or, you've fought with the complexities of reactive solutions in the Java space -- often, frankly, in the name of "scalability" that will never be practically required -- this is a big deal.
You no longer have to worry about any of that.
If you decide somewhere deep in your program you want to use async operations, most languages allow you to keep the invoking function/closure synchronous and return some kind of Promise/Future-like value
The context switch (how ever small) will cause latency when this solution is at saturation.
I think they should write four tests: fiber, NIO and each with userspace networking (no kernel copying network memory) and compare them.
Why Oracle is stalling removing the kernel for Java networking is surprising to me, they allready have a VM.
https://kotlinlang.org/spec/asynchronous-programming-with-co...
However... an unavoidable fact is that converted code works differently to other code. The programmer needs to know the difference. Normal and converted code compose together differently. The Kotlin compiler and type system helps keep track, but it can't paper over everything.
Having lightweight thread and continuations support directly in the VM makes things very much simpler for programmers (and compiler writers!) since the VM can handle the details of suspending/resuming and code composes together effortlessly, even without compiler support, so it works across languages and codebases.
I don't want to be critical about Kotlin. It's amazing what it achieves and I'm a big fan of this stuff. Here are some notes I wrote on something similar, Scala's experiments with compile-time delimited continuations: https://rd.nz/2009/02/delimited-continuations-in-scala_24.ht...
I think this is a general principle about compiler features vs runtime features. Having things in the runtime makes life a lot easier for everyone, at the cost of runtime complexity, of course.
Another one I'd like to see is native support for tail calls in Java. Kotlin, Scala, etc have to do compile-time tricks to get basic tail call support, but it doesn't work across functions well.
Scala and Kotlin both ask the programmer to add annotations where tail calls are needed, since the code gen so often fails.
https://kotlinlang.org/docs/functions.html#tail-recursive-fu...
https://www.scala-lang.org/api/3.x/scala/annotation/tailrec....
https://rd.nz/2009/04/tail-calls-tailrec-and-trampolines.htm...
As a side note, I can see that tail calls are planned for Project Loom too, but I haven't heard if that's implemented yet. Does anyone know the status?
"Project Loom is to intended to explore, incubate and deliver Java VM features and APIs built on top of them for the purpose of supporting easy-to-use, high-throughput lightweight concurrency and new programming models on the Java platform. This is accomplished by the addition of the following constructs:
* Virtual threads
* Delimited continuations
* Tail-call elimination"
What's remarkable about this experiment is that it uses simple 26-year-old (Java 1.0) networking APIs.
We need a standardized computer for benchmarking these types of claims. I propose the RasPi 4 4GB model. Everybody can find one, all the hardware's soldered on so no cheating is really possible, etc. Then we can really shoot for efficiency.
There are limits in the linux kernel, and the 5m concurrent connections was chosen to exceed it.
From what I remember (my knowledge is ancient though), a Java thread consumes a pid_t in the linux kernel. By default this is limited to 64k. However, this can be increased by setting a flag in the kernel, to a maximum 2^22 or 4m.
In order to have more than 4m connections, the existing Java code either needs to be changed to be event driven, or it can't use kernel threads.
Event driven code is very different. It's very powerful, but it is very easy to get lost. Think writing Java code that looks like a Makefile with dependencies or "andThen" everywhere, and everyone having to make sure everything is threadsafe. Thread safety is hard for large teams with high qps services - deadlocks can bring down a service.
If a developer can write "regular" non-re-entrant Java code and still get the concurrent connections? Win all around.
LMAO I wish.
Is there anyway for the TCP connections share memory in kernel space? My experiment only uses two 8 byte buffers in userspace.
* I don't know if someone has created some experimental implementation somewhere. It would require a significant overhaul of the TCP implementation in the kernel.
edit: check out this sibling thread about userland TCP. I think this is a more interesting/likely direction to explore in. https://news.ycombinator.com/item?id=31215569
Otoh, FreeBSD's maximum FD limit is set as a factor of total memory pages (edit: looked it up, it's in sys/kern/subr_param.c, the limit is one FD per four pages, unless you edit kernel source) and you've got 2M pages with 8GB ram, so you would be limited to 512k FDs total, and if you're running the client on the same machine as server, that's 256k connections. But 8G is not much for a server, and some phones have more than that... so it's not super limiting.
When you're really not doing much with the connections, userland tcp as suggest in a sibling, could help you squeeze in more connections, but if you're going to actually do work, you probably need more ram.
Btw, as a former WhatsApp server engineer, WhatsApp listens on three ports; 80, 443, and 5222. Not that that makes a significant difference in the content.
But independent of socket buffers, the kernel obviously needs to allocate other state per socket, which tracks the state of the TCP connection.
I've used explicit context switching syscalls to "mock out" embedded real time OS task switching APIs. It's pretty fun and useful. The context switching itself may not be any faster than if the kernel does it, but the fact that it's synchronous to your program flow means that you don't have to spend any overhead synchronizing to mutexes, queues, etc. (You still have them, they just don't have to be thread safe.)
Yes.
A TCP connection state machine consists of a few variables to keep track of sequence numbers and congestion control parameters (no more than 100-200 bytes total), plus the space for send/receive buffers.
A 4 TB SSD would fit ~125 million 16-KB buffer pairs, and 125 million 256-byte structs would take up only 32 GB of memory. In theory, handling 100 million simultaneous connections on a single machine is totally doable. Of course, the per-connection throughput would be complete doodoo even with the best NICs, but it would still be a monumental yet achievable milestone.
It's interesting to think about though, I agree. What are the next scaling bottlenecks now (for JVM compatible languages) threading is nearly solved?
There are some obvious ones. Others in the thread have pointed out network bandwidth. Some use cases don't need much bandwidth but do need intense routability of data between connections, like chat apps, and it seems ideal for those. Still, you're going to face other problems:
1. If that process is restarted for any reason that's a lot of clients that get disrupted. JVMs are quite good at hot-reloading code on the fly, so it's not inherently the case that this is problematic because you could make restarts very rare. But it's still a problem.
2. Your CPU may be sufficient for the steady state but on restart the clients will all try to reconnect at once. Adding jitter doesn't really solve the issue, as users will still have to wait. Handling 5M connections is great unless it takes a long time to reach that level of connectivity and you are depending on it.
3. TCP is rarely used alone now, it usually comes with SSL. Doing SSL handshakes is more expensive than setting up a TCP connection (probably!). Do you need to use something like QUIC instead? Or can you offload that to the NIC making this a non-issue? I don't know. BTW the Java SSL stack is written in Java itself so it's fully Loom compatible.
I don't think QUIC helps with that at all. Afaik, QUIC is all userland, so you'd skip kernel processing, but that doesn't really make establishment cheaper. And TCP+TLS establishes the connection before doing crypto, so that saves effort on spoofing (otoh, it increases the round trips, so pick your tradeoffs).
One nice thing about TCP though is it's trivial to determine if packets are establishing or connected; you can easily drop incoming SYNs when CPU is saturated to put back pressure on clients. That will work enough when crypto setup is the issue as well. Operating systems will essentially do this for you if you get behind on accepting on your listen sockets. (Edit) syncookies help somewhat if your system gets overwelmed and can't keep state for all of them half-established connections, although not without tradeoffs.
In the before times, accelerator cards for TLS handshakes were common (or at least available), but I think current NIC acceleration is mainly the bulk ciphering which IMHO is more useful for sending files than sending small data that I'd expect in a large connection count machine. With file sending, having the CPU do bulk ciphers is a RAM bottleneck: the CPU needs to read the data, cipher it, and write to RAM then tell the NIC to send it; if the NIC can do the bulk cipher that's a read and write omitted. If it's chat data, the CPU probably was already processing it, so a few cycles with AES instructions to cipher it before sending it to send buffers is not very expensive.
For extremely IO wait bound workloads though, there was always a LOT if hoops to jump through to make performance strong since OS threads always have a notable stack memory footprint that just doesn't scale well when you could have thousands of OS threads waiting around just taking up RAM.
Moving 100M connections for maintenance will be a giant pain though. You would want to spend a good amount of time on a test suite so you can have confidence in the new deploys when you make them. Also, the client side of testing will probably be harder to scale than the server side... but you can do things like run 1000 test clients with 100k outgoing connections each to help with that.
Also tickled to see my erlang 1M comet blog post referenced. A lifetime ago now, pre-websockets.
IMHO it's only JVM+Graal that can bring this to other languages. Loom relies very heavily on some fairly unique aspects of the Java ecosystem (Go has these things too though). One is that lots of important bits of code are implemented in pure Java, like the IO and SSL stacks. Most languages rely heavily on FFI to C libraries. That's especially true of dynamic scripting languages but is also true of things like Rust. The Java world has more of a culture of writing their own implementations of things.
For the Loom approach to work you need:
a. Very tight and difficult integration between the compiler, threading subsystem and garbage collector.
b. The compiler/runtime to control all code being used. The moment you cross the FFI into code generated by another compiler (i.e. a native library) you have to pin the thread and the scalability degrades or is lost completely.
But! Graal has a trick up its sleeve. It can JIT compile lots of languages, and those languages can call into each other without a classical FFI. Instead the compiler sees both call site and destination site, and can inline them together to optimize as one. Moreover those languages include binary languages like LLVM bitcode and WASM. In turn that means that e.g. Python calling into a C extension can still work, because the C extension will be compiled to LLVM bitcode and then the JVM will take over from there. So there's one compiler for the entire process, even when mixing code from multiple languages. That's what Loom needs.
At least in theory. Perhaps pron will contradict me here because I have a feeling Loom also needs the invariant that there are no pointers into the stack. True for most languages but not once C gets involved. I don't know to what extent you could "fix" C programs at the compiler level to respect that invariant, even if you have LLVM bitcode. But at least the one-compiler aspect is not getting in the way.
Also, why are these not default for the O/S? What are we compromising by setting those values?
For application level, it's going to depend on how you handle concurrency. This post is interesting, because it's a benchmark of a different way to do it in Java. You could probably do 5M connections in regular Java through some explicit event loop structure; but with the Loom preview, you can do it connection per Thread. You would be unlikely to do it with connection per Thread without Loom, since Linux threads are very unlikely to scale so high (but I'd be happy to read a report showing 5M Linux threads)
However there are other reasons why a C++ applications connected to the internet might indeed die faster than a Java one.
Some back of the envelope maths: https://www.wolframalpha.com/input?i=100+Gbps+%2F+5+million
If the server had a 100 Gbps Ethernet NIC, this would leave just 20 kbps for each TCP connection.
I could imagine some IoT scenarios where this might be a useful thing, but outside of that? I doubt there's anyone that wants 20 kbps throughput in this day and age...
It's a good stress test however to squeeze out inefficiencies, super-linear scaling issues, etc...
I'm very excited about the possibilities of Loom. Would love to have a more realistic sample with Spring Boot that would demonstrate the real world scale. I saw a few but nothing remotely as ambitious as that.
It's largely a collection of the same libraries you would use anyways glued together with a custom di system.
net.netfilter.nf_conntrack_buckets = 1966050
net.netfilter.nf_conntrack_max = 7864200
or avoid conntrack entirely options nf_conntrack expect_hashsize=X hashsize=X
in /etc/modules.d/nf_conntrack.conf, X being 1/4 the size of conntrack_maxOr is this a test where something actually happens (data exchanges) with each connection?
I ask because those are two totally different workloads and typically where in the later test Erlang shines.
- Green threads scale somewhat better, but both scale ridiculously well, meaning probably you won't run into scaling issues.
- async/await generators use way less memory than a dedicated green thread, this affects both memory consumption and startup time, since the process has to run around asking the OS for more memory
- green threads are faster to execute
Here's the link:
https://alexyakunin.medium.com/go-vs-c-part-1-goroutines-vs-...
For those who don't understand this, Kotlin's co-routine framework is designed to be language neutral and already works on top the major platforms that have kotlin compilers (native, javascript, jvm, and soon wasm). So, it doesn't really compete with the "native" way of doing concurrent, aynchronous, or parallel computing on any of those platforms but simply abstracts the underlying functionality.
It's actually a multi platform library that implements all the platform specific aspects in the platform appropriate way. It's also very easy to adapt existing frameworks in this space via Kotlin extension functions and the JVM implementation actually ships out of the box with such functions for most common solutions on the JVM for this (Java's threads, futures, threadpools, etc., Spring Flux, RxJava, Vert.x, etc.). Loom will be just another solution in this long list.
If you use Spring Boot with Kotlin for example, rather than dealing with Spring's Flux, you simply define your asynchronous resources as suspend functions. Spring does the rest.
With Kotlin-js in a browser you can call Promise.toCoroutine() ans async { ... }.asPromise(). That makes it really easy to write asynchronous event handling in a web application for example or work with javascript APIs that expect promises from Kotlin. And if you use web-compose, fritz2, or even react with kotlin-js, anything asynchronous, you'd likely be dealing with via some kind of co-routine and suspend functions.
Once Loom ships, it basically will enable some nice, low level optimization to happen in the JVM implementation for co-routines and there will likely be some new extension functions to adapt the various new Java APIs for this. Not a big deal but it will probably be nice for situations with extremely large amounts of co-routines and IO. Not that it's particularly struggling there of course but all little bits help. It's not likely to require any code updates either. When the time comes, simply update your jvm and co-routine library and you should be good to go.
I won't repeat it all, but the main point is that having runtime support is much better than relying on compiler support, even if compiler support is pretty fantastic.
Note that the two aren't mutally exclusive, you should still be able to use coroutines after Project Loom ships, and it still might make sense in many places.
2. There is no need for a split world of APIs, some designed for threads and others for coroutines (so-called "function colouring"). Existing APIs, third-party libraries, and programs — even those dating back to Java 1.0 (just as this experiment does with Java 1.0's java.net.ServerSocket) — just work on millions of virtual threads.
Normally, you wouldn't even call Thread.startVirtualThread(), but just replace your platform-thread-pool-based ExecutorService with an ExecutorService that spawns a new virtual thread for each task (Executors.newVirtualThreadPerTaskExecutor()). For more details, see the JEP: https://openjdk.java.net/jeps/425
It's almost a little disappointing that beefy modern servers only manage a x5 scale improvement, though that could be due to the differences in runtime behaviour between Erlang and the JVM.
That's a very cool and a noble pursuit. But the title of this article might as well have been "5M persistent connections with Linux" because that's where the magic 5M connections happen.
I could also attempt 5M connections at the Java level using Netty and asynchronous IO - no threads or Loom. Again, it'd take more Linux configuration than anything else. If that configuration did happen though now you can also do it in C# async/await, javascript, I'm sure Erlang and anything else that does Asynchronous I/O whether it's masked by something like Loom/Async/Await or not.
So while you could achieve 5M in other ways, those ways would not only be more complex, but also not really observable/debuggable by Java platform tools.
Writing the sort of applications that I get involved with, it's frequently the case whilst it's true that 1 OS thread/java thread was a theoretical scalability limitation - in practice we were never likely to hit it (and there was always the 'get a bigger computer').
But: the complexity mavens inside our company and projects we rely upon get bitten by an obsessive need to chase 'scalability' /at all costs/. Which is fine, but the downside to that is the negative consequences of coloured functions comes into play. We end up suffering having to deal with vert.x or kotlin or whatever flavour-of-the-month solution is that is /inherently/ harder to reason about than a linear piece of code. If you're in a c# project, the you get a library that's async, and boom, game over.
If loom gets even within performance shouting distance of those other models, it's ought to kill (for all but the edgiest of edge-cases) reactive programming in the java space dead. You might be able to make a case - obviously depending on your use cases which are not mine - that extracting, say, 50% more scalability is worth the downsides. If that number is, say, 5%, then for the vast majority of projects the answer is going to be 'no'.
I say 'ought to', as I fear the adage that "developers love complexity the way moths love flames - and often with the same results". I see both engineers and projects (Hibernate and keycloak, IIRC) have a great deal of themselves invested in their Rx position, and I already sense that they're not going to give it up without a fight.
So: the headline number is less important than "for virtually everyone you will no longer have to trade simplicity for scalability". I can't wait!
I still attest though - The 5M connections in this example is still a red herring.
Can we get to 6M? Can we get to 10M? Is that a question for Loom or Java's asynchronous IO system? No - it's a question for the operating system.
Loom and Java NIO can handle probably a billion connections as programmed. Java Threads cannot - although that too is a broken statement. "Linux Threads cannot" is the real statement. You can't have that many for resource reasons. Java Threads are just a thin abstraction on top of that.
Linux out of the box can't do 5M connections (last I checked). It takes Linux tuning artistry to get it there.
Don't get me wrong - I think Loom is cool. It's attempted to do the same thing as Async/Await tried - just better. But it is most definitely not the only way to achieve 5MM connections with Java or anything else. Possibly however, it's the most friendly and intuitive way to do it.
*We typically vilify Java Threads for the Ram they consume. Something like 1M per thread or something (tunable). Loom must still use "some" ram per connection although surely far far less (and of course Linux must use some amount of kernel ram per connection too).
Having run production services that had over 250,000 sockets connecting to a single server port, I'm calling "nope" on that.
Are you thinking of the ephemeral port limit? That's on the client side; not the server side. Each TCP socket pair is a four-tuple of [server IP, server port, client IP, client port]; the uniqueness comes from the client IP/port part in the server case.
The real problem with such a setup is that you're not left with a whole lot of bandwidth per connection, even if you ignore things like packet loss and retransmits mucking up the connections. Most VPS servers have a 1gbps connection, with 5 million clients that leaves 200 bytes per second of concurrent bandwidth for TCP signaling and data to flow through. You'll need a ridiculous network card for a single server to deal with such a load, in the terabits per second range.
Cloudflare has some interesting blog posts on this topic:
- https://blog.cloudflare.com/how-we-built-spectrum/
- https://blog.cloudflare.com/how-to-stop-running-out-of-ephem...
If you suppose just one open server port, you’ll probably need 77 client ips to do this test to get unique socket pairs.
But it’s a client problem, not a server one.
Clients can connect to the server on the same server port, so connection limit is more like 64k*2 for every Client IP-Server IP pair.
Time has shown that bare threads are not a viable high-level API for managing concurrency. As it turns out, we humans don't think in terms of locks and condvars but "to do X, I first need to know Y". That maps perfectly onto futures(/promises). And once you have those, you don't need all the extra complexity and hacks that green threads (/"colourless async") bring in.
I'd take a system that combined the API of futures with the performance of OS threads over the opposite combination, any day of the week. But as it turns out, we don't have to choose. We can have the performance of futures with the API of futures.
Or we can waste person-years chasing mirages, I guess. I just hope I won't get stuck having to use the end product of this.
Especially when that future scheduler already exists and works, and the preemptive one is a multi-year research project away.
Go is just yet another implementation of green threads that is slightly less broken than prior implementations, because it had the benefit of being implemented on day 1 (so the whole ecosystem is green thread-aware). It's certainly nowhere near "best-in-class".
Threads don't require locks and condvars. You can use channels and scoped joins etc. if you want.
Give me some async code and I'll show you an easier threaded version.
I don't find myself missing out on futures in Go.