Achieving 5M persistent connections with Project Loom virtual threads (opens in new tab)

(github.com)

309 pointsgenzer4y ago145 comments

145 comments

I think a lot of people are missing the point.

Go look at the sourcecode. Look at how simple it is - anyone who has created a thread with java knows what's happening. With only minor tweaks, this means your pre-existing code can take advantage of this with, basically, no effort. And it retains all the debuggability of traditional java thread (I.e: a stack trace that makes sense!)

If you've spent any time at all dealing with the horrors of c# async/await (Why am I here? Oh, no idea) and it's doubling of your APIs to support function colouring - or, you've fought with the complexities of reactive solutions in the Java space -- often, frankly, in the name of "scalability" that will never be practically required -- this is a big deal.

You no longer have to worry about any of that.

pjmlp4y ago

Or inserting the occasional Task.Run() calls, as means to avoiding changing the whole call stack up to Main().

gavinray4y ago

This hasn't been that much of a problem, IME

If you decide somewhere deep in your program you want to use async operations, most languages allow you to keep the invoking function/closure synchronous and return some kind of Promise/Future-like value

1 more reply

bullen4y ago

Agreed it's simpler, but using NIO with one OS thread per core also has it's benefits.

The context switch (how ever small) will cause latency when this solution is at saturation.

I think they should write four tests: fiber, NIO and each with userspace networking (no kernel copying network memory) and compare them.

Why Oracle is stalling removing the kernel for Java networking is surprising to me, they allready have a VM.

pron4y ago

https://github.com/ebarlas/project-loom-comparison

2 more replies

blibble4y ago

there's still a context switch with NIO, you're just doing it manually

1 more reply

SemanticStrengh4y ago

Except Kotlin coroutines already works, can be very easily integrated in existing java codebases and are much superior than loom (structured concurrency, flow, etc)

richdougherty4y ago

Kotlin coroutines are amazing. They're built on very clever tech that converts fairly normal source code into a state machine when compiled. This has huge benefits and allows the programmer to break their code up without the hassle of explicitly programming callbacks, etc.

https://kotlinlang.org/spec/asynchronous-programming-with-co...

However... an unavoidable fact is that converted code works differently to other code. The programmer needs to know the difference. Normal and converted code compose together differently. The Kotlin compiler and type system helps keep track, but it can't paper over everything.

Having lightweight thread and continuations support directly in the VM makes things very much simpler for programmers (and compiler writers!) since the VM can handle the details of suspending/resuming and code composes together effortlessly, even without compiler support, so it works across languages and codebases.

I don't want to be critical about Kotlin. It's amazing what it achieves and I'm a big fan of this stuff. Here are some notes I wrote on something similar, Scala's experiments with compile-time delimited continuations: https://rd.nz/2009/02/delimited-continuations-in-scala_24.ht...

I think this is a general principle about compiler features vs runtime features. Having things in the runtime makes life a lot easier for everyone, at the cost of runtime complexity, of course.

Another one I'd like to see is native support for tail calls in Java. Kotlin, Scala, etc have to do compile-time tricks to get basic tail call support, but it doesn't work across functions well.

Scala and Kotlin both ask the programmer to add annotations where tail calls are needed, since the code gen so often fails.

https://kotlinlang.org/docs/functions.html#tail-recursive-fu...

https://www.scala-lang.org/api/3.x/scala/annotation/tailrec....

https://rd.nz/2009/04/tail-calls-tailrec-and-trampolines.htm...

As a side note, I can see that tail calls are planned for Project Loom too, but I haven't heard if that's implemented yet. Does anyone know the status?

"Project Loom is to intended to explore, incubate and deliver Java VM features and APIs built on top of them for the purpose of supporting easy-to-use, high-throughput lightweight concurrency and new programming models on the Java platform. This is accomplished by the addition of the following constructs:

* Virtual threads

* Delimited continuations

* Tail-call elimination"

https://wiki.openjdk.java.net/display/loom/Main

1 more reply

pron4y ago

For more information about virtual threads see https://openjdk.java.net/jeps/425 (planned to preview in JDK 19, out this September).

What's remarkable about this experiment is that it uses simple 26-year-old (Java 1.0) networking APIs.

midislack4y ago

I see a lot of these making the FP of HN. But it's very difficult to be impressed, or unimpressed because it's all about hardware. How much hardware is everybody throwing at all of this? 5M persistent connections on a Pi with mere GigE? Pretty frickin' amazing. 5M persistent connections on a Threadripper with 128 cores and a dozen trunked 4 port 10GE NICs? Yaaaaawwwnnn snooze.

We need a standardized computer for benchmarking these types of claims. I propose the RasPi 4 4GB model. Everybody can find one, all the hardware's soldered on so no cheating is really possible, etc. Then we can really shoot for efficiency.

jpollock4y ago

This isn't about the hardware, it's about thread count.

There are limits in the linux kernel, and the 5m concurrent connections was chosen to exceed it.

From what I remember (my knowledge is ancient though), a Java thread consumes a pid_t in the linux kernel. By default this is limited to 64k. However, this can be increased by setting a flag in the kernel, to a maximum 2^22 or 4m.

In order to have more than 4m connections, the existing Java code either needs to be changed to be event driven, or it can't use kernel threads.

Event driven code is very different. It's very powerful, but it is very easy to get lost. Think writing Java code that looks like a Makefile with dependencies or "andThen" everywhere, and everyone having to make sure everything is threadsafe. Thread safety is hard for large teams with high qps services - deadlocks can bring down a service.

If a developer can write "regular" non-re-entrant Java code and still get the concurrent connections? Win all around.

shadowpho4y ago

Raspberry pi 4 performance changes wildly based on cooling. Bare die vs heatsink vs heatsink + fan will give you wildly different results.

midislack4y ago

Same is true with any computer these days. So let's go no heat sink, Pi 4 4GB anyway.

1 more reply

niederman4y ago

> Everybody can find one

LMAO I wish.

https://rpilocator.com/?cat=PI4

kmelva4y ago

Could a 128c Threadripper even do 5M kernel threads?

TYMorningCoffee4y ago

I was only able to get to 840,000 open connections with my experiment. My machine only has 8GB of memory. https://josephmate.github.io/2022-04-14-max-connections/

Is there anyway for the TCP connections share memory in kernel space? My experiment only uses two 8 byte buffers in userspace.

mh-4y ago

no*, and as you've discovered, the skbufs allocated by the kernel will often be the limiting factor for a highly concurrent socket server on linux.

* I don't know if someone has created some experimental implementation somewhere. It would require a significant overhaul of the TCP implementation in the kernel.

edit: check out this sibling thread about userland TCP. I think this is a more interesting/likely direction to explore in. https://news.ycombinator.com/item?id=31215569

toast04y ago

Does Linux actually allocate buffers for each socket or does it just link to sk_buff's (which I understand are similar to FreeBSD's mbuf's) and then limit how much storage can be linked? FreeBSD has a limit on the total ram used for mbufs as well, not sure about Linux.

Otoh, FreeBSD's maximum FD limit is set as a factor of total memory pages (edit: looked it up, it's in sys/kern/subr_param.c, the limit is one FD per four pages, unless you edit kernel source) and you've got 2M pages with 8GB ram, so you would be limited to 512k FDs total, and if you're running the client on the same machine as server, that's 256k connections. But 8G is not much for a server, and some phones have more than that... so it's not super limiting.

When you're really not doing much with the connections, userland tcp as suggest in a sibling, could help you squeeze in more connections, but if you're going to actually do work, you probably need more ram.

Btw, as a former WhatsApp server engineer, WhatsApp listens on three ports; 80, 443, and 5222. Not that that makes a significant difference in the content.

Matthias2474y ago

I think the socket buffers (sk_buff) are actually shared. They are all packet sized, and whatever socket needs to transmit some data or receives it gets the buffers attached. So my assumption is that the amount of required socket buffers scales more with the amount of data transmission than with the number of sockets.

But independent of socket buffers, the kernel obviously needs to allocate other state per socket, which tracks the state of the TCP connection.

sgtnoodle4y ago

I'm not a java programmer. I tried clicking 3 layers deep of links, but still have no idea what virtual threads are in this context. Is it a userspace thread implementation?

I've used explicit context switching syscalls to "mock out" embedded real time OS task switching APIs. It's pretty fun and useful. The context switching itself may not be any faster than if the kernel does it, but the fact that it's synchronous to your program flow means that you don't have to spend any overhead synchronizing to mutexes, queues, etc. (You still have them, they just don't have to be thread safe.)

grishka4y ago

> Is it a userspace thread implementation?

Yes.

christophilus4y ago

Loom looks like it’s nicely solved the function coloring problem. This plus Graal makes me excited to pick up Clojure again.

10000truths4y ago

A bit of a digression, but I’d love to see how much further one could go with a memory-optimized userland TCP stack, and storing the send and receive buffers on disk.

A TCP connection state machine consists of a few variables to keep track of sequence numbers and congestion control parameters (no more than 100-200 bytes total), plus the space for send/receive buffers.

A 4 TB SSD would fit ~125 million 16-KB buffer pairs, and 125 million 256-byte structs would take up only 32 GB of memory. In theory, handling 100 million simultaneous connections on a single machine is totally doable. Of course, the per-connection throughput would be complete doodoo even with the best NICs, but it would still be a monumental yet achievable milestone.

mike_hearn4y ago

Presumably at 100M simultaneous connections the machine CPU would be saturated with setting up and closing them, without getting much actual work done. TCP connections seem too fragile to make it worth trying to keep them open for really long periods.

It's interesting to think about though, I agree. What are the next scaling bottlenecks now (for JVM compatible languages) threading is nearly solved?

There are some obvious ones. Others in the thread have pointed out network bandwidth. Some use cases don't need much bandwidth but do need intense routability of data between connections, like chat apps, and it seems ideal for those. Still, you're going to face other problems:

1. If that process is restarted for any reason that's a lot of clients that get disrupted. JVMs are quite good at hot-reloading code on the fly, so it's not inherently the case that this is problematic because you could make restarts very rare. But it's still a problem.

2. Your CPU may be sufficient for the steady state but on restart the clients will all try to reconnect at once. Adding jitter doesn't really solve the issue, as users will still have to wait. Handling 5M connections is great unless it takes a long time to reach that level of connectivity and you are depending on it.

3. TCP is rarely used alone now, it usually comes with SSL. Doing SSL handshakes is more expensive than setting up a TCP connection (probably!). Do you need to use something like QUIC instead? Or can you offload that to the NIC making this a non-issue? I don't know. BTW the Java SSL stack is written in Java itself so it's fully Loom compatible.

toast04y ago

You're totally spot on that connection establishment is much more challenging than steady state; with TLS or just TCP.

I don't think QUIC helps with that at all. Afaik, QUIC is all userland, so you'd skip kernel processing, but that doesn't really make establishment cheaper. And TCP+TLS establishes the connection before doing crypto, so that saves effort on spoofing (otoh, it increases the round trips, so pick your tradeoffs).

One nice thing about TCP though is it's trivial to determine if packets are establishing or connected; you can easily drop incoming SYNs when CPU is saturated to put back pressure on clients. That will work enough when crypto setup is the issue as well. Operating systems will essentially do this for you if you get behind on accepting on your listen sockets. (Edit) syncookies help somewhat if your system gets overwelmed and can't keep state for all of them half-established connections, although not without tradeoffs.

In the before times, accelerator cards for TLS handshakes were common (or at least available), but I think current NIC acceleration is mainly the bulk ciphering which IMHO is more useful for sending files than sending small data that I'd expect in a large connection count machine. With file sending, having the CPU do bulk ciphers is a RAM bottleneck: the CPU needs to read the data, cipher it, and write to RAM then tell the NIC to send it; if the NIC can do the bulk cipher that's a read and write omitted. If it's chat data, the CPU probably was already processing it, so a few cycles with AES instructions to cipher it before sending it to send buffers is not very expensive.

1 more reply

adra4y ago

I'm pretty sure the exercise was to show the absolute extremes that could be achieved in a toy application and possibly how easy one could achieve some level of IO blocking scaling that has been harder than most other tasks in java of late. More and more, heap allocations are cheaper, often with sub-milli collector locks, CPU scaling has more to do with what you're doing instead of the platform, but java have enough tools to make your application fast.

For extremely IO wait bound workloads though, there was always a LOT if hoops to jump through to make performance strong since OS threads always have a notable stack memory footprint that just doesn't scale well when you could have thousands of OS threads waiting around just taking up RAM.

natdempk4y ago

It depends on what you do, but I think GC/memory pressure can become an issue rather quickly with the default programming models Java leads you towards. I end up seeing this a lot in somewhat high throughput services/workers I own where fetching a lot of data to handle requests and discarding it afterwards leads to a lot of GC time. Curious if anyone has any sage advice on this front.

charcircuit4y ago

I think you meant to say TLS. Not SSL.

toast04y ago

It's easy to just get 4TB of ram if that's what you need; I haven't scoped out what you can shove into a cheap off the shelf server these days, but I'd guess around 16TB before you need to get fancy servers (Edit: maybe 8TB is more realistic after looking at SuperMicro's 'Ultra' servers). I think you'd need a very specialized applicatjon for 100M connections per server to make sense, but if you've got one, that sounds like a fun challenge; my email is in my profile.

Moving 100M connections for maintenance will be a giant pain though. You would want to spend a good amount of time on a test suite so you can have confidence in the new deploys when you make them. Also, the client side of testing will probably be harder to scale than the server side... but you can do things like run 1000 test clients with 100k outgoing connections each to help with that.

metabrew4y ago

API for the server example looks... actually good, wow. Nice job!

Also tickled to see my erlang 1M comet blog post referenced. A lifetime ago now, pre-websockets.

nelsonic4y ago

Reminds of https://phoenixframework.org/blog/the-road-to-2-million-webs... Would love to see this extended to more Languages/Frameworks.

mike_hearn4y ago

In theory once Graal adds support for it, any Graal/Truffle-compatible language can benefit.

IMHO it's only JVM+Graal that can bring this to other languages. Loom relies very heavily on some fairly unique aspects of the Java ecosystem (Go has these things too though). One is that lots of important bits of code are implemented in pure Java, like the IO and SSL stacks. Most languages rely heavily on FFI to C libraries. That's especially true of dynamic scripting languages but is also true of things like Rust. The Java world has more of a culture of writing their own implementations of things.

For the Loom approach to work you need:

a. Very tight and difficult integration between the compiler, threading subsystem and garbage collector.

b. The compiler/runtime to control all code being used. The moment you cross the FFI into code generated by another compiler (i.e. a native library) you have to pin the thread and the scalability degrades or is lost completely.

But! Graal has a trick up its sleeve. It can JIT compile lots of languages, and those languages can call into each other without a classical FFI. Instead the compiler sees both call site and destination site, and can inline them together to optimize as one. Moreover those languages include binary languages like LLVM bitcode and WASM. In turn that means that e.g. Python calling into a C extension can still work, because the C extension will be compiled to LLVM bitcode and then the JVM will take over from there. So there's one compiler for the entire process, even when mixing code from multiple languages. That's what Loom needs.

At least in theory. Perhaps pron will contradict me here because I have a feeling Loom also needs the invariant that there are no pointers into the stack. True for most languages but not once C gets involved. I don't know to what extent you could "fix" C programs at the compiler level to respect that invariant, even if you have LLVM bitcode. But at least the one-compiler aspect is not getting in the way.

kaba04y ago

With Truffle you have to map your language’s semantics to java ones. I am unfortunately out of my depth on the details, but my guess would be that LLVM operates here with this in mind in a completely safe way (I guess pointers to the stack are not safe) so presumably it should work for these as well.

1 more reply

bkolobara4y ago

With lunatic [0] we are trying to bring this to all languages that compile to WebAssembly. A few days ago I wrote about our journey of bringing it to Rust: https://lunatic.solutions/blog/writing-rust-the-elixir-way-1...

[0]: https://github.com/lunatic-solutions/lunatic

wiradikusuma4y ago

The experiment is about Java app, but the tweaks are at the O/S level. Does it mean any app (Java/not, Loom/not) can achieve target given correct tweak?

Also, why are these not default for the O/S? What are we compromising by setting those values?

toast04y ago

You need both your operating system and your application environment need to be up to the task. I'd expect most operating systems to be up to the task; although it might need settings set. Some of the settings are things that are statically allocated in non-swappable memory and you don't want to waste memory on being able to to have 5M sockets open if you never go over 10k. Often you'll want to reduce socket buffers from defaults, which will reduce throughput per socket, but target throughput per socket is likely low or you wouldn't want to cram so many connections per client. You may need to increase the size of the connection table and the hash used for it as well; again, it wastes non-swappable ram to have it too big if you won't use it.

For application level, it's going to depend on how you handle concurrency. This post is interesting, because it's a benchmark of a different way to do it in Java. You could probably do 5M connections in regular Java through some explicit event loop structure; but with the Loom preview, you can do it connection per Thread. You would be unlikely to do it with connection per Thread without Loom, since Linux threads are very unlikely to scale so high (but I'd be happy to read a report showing 5M Linux threads)

mike_hearn4y ago

No, it doesn't. The reason the tweaks are at the OS level is because, apparently, Loom-enabled JVMs already scale up to that level without needing any tuning. But if you try that in C++ you're going to die very quickly.

pjmlp4y ago

With C++ co-routines and a runtime like HPX, not really.

However there are other reasons why a C++ applications connected to the internet might indeed die faster than a Java one.

gpderetta4y ago

There have been userspace thread libraries for c++ for decades.

1 more reply

jiggawatts4y ago

There's always trade-offs. It would be very rare for any server to reach even 100K concurrent connections, let alone 5M. Optimising for that would be optimising for the 0.000001% case at the expense of the common case.

Some back of the envelope maths: https://www.wolframalpha.com/input?i=100+Gbps+%2F+5+million

If the server had a 100 Gbps Ethernet NIC, this would leave just 20 kbps for each TCP connection.

I could imagine some IoT scenarios where this might be a useful thing, but outside of that? I doubt there's anyone that wants 20 kbps throughput in this day and age...

It's a good stress test however to squeeze out inefficiencies, super-linear scaling issues, etc...

jeroenhd4y ago

20kbps should be sufficient for things like chat apps if you have the CPU power to actually process chat messages like that. Modern apps also require attachments and those will require more bandwidth, but for the core messaging infrastructure without backfilling a message history I think 20kbps should be sufficient. Chat apps are bursty, after all, leaving you with more than just the average connection speed in practice.

1 more reply

Koffiepoeder4y ago

Open, idle websockets can be a use case for a large amount of tcp connections with a small data footprint.

1 more reply

invalidname4y ago

This is pretty fantastic!

I'm very excited about the possibilities of Loom. Would love to have a more realistic sample with Spring Boot that would demonstrate the real world scale. I saw a few but nothing remotely as ambitious as that.

isbvhodnvemrwvn4y ago

Spring Boot overhead would likely make that infeasible.

RhodesianHunter4y ago

Spring boot overhead is largely in startup time. It really doesn't have much overhead there after.

It's largely a collection of the same libraries you would use anyways glued together with a custom di system.

invalidname4y ago

I'm not saying 5M. I just want to see to what scale it would get without threading issues. Spring Boot isn't THAT heavy.

the84724y ago

   net.netfilter.nf_conntrack_buckets = 1966050
   net.netfilter.nf_conntrack_max = 7864200

or avoid conntrack entirely

LinuxBender4y ago

For completeness sake I would add that one must also set

  options nf_conntrack expect_hashsize=X hashsize=X

in /etc/modules.d/nf_conntrack.conf, X being 1/4 the size of conntrack_max

alberth4y ago

Is this a test of just having 5M people knock on your door?

Or is this a test where something actually happens (data exchanges) with each connection?

I ask because those are two totally different workloads and typically where in the later test Erlang shines.

bufferoverflow4y ago

It's an echo server. The client sends the data, the server responds with the same data.

imranhou4y ago

It looks more closer to go routines, which to me begs the question - where are the channels that I could use to communicate between these virtual threads?

adra4y ago

Go's channels are simplistically a mutex in front of a queue. Java has many existing objects that can do the same, it's just that's not idiomatic best choice to do the same. Since green threads should wake up from Object.notify(), any threads blocking on the monitor should wake/consume. I'm curious how scalable/performance a green thread ConcurrentDequeue would stand up to go's channel.

Matthias2474y ago

You are right. But Go Channels come also with the superpower of „select“, which allows to wait for multiple objects to become ready and atomic execution of actions. I don’t think this part can be retrofitted on top of simple BlockingQueues.

1 more reply

sdfgdfgbsdfg4y ago

In a library. Loom is more about adapting the JVM itself for continuations and virtual threads than adding to userspace.

Andrew_nenakhov4y ago

Sounds like a job for Erlang.

speed_spread4y ago

Sounds like Erlang's out of a job.

Andrew_nenakhov4y ago

No.

deepsun4y ago

How does that compare to Kotlin suspend functions?

torginus4y ago

While I can't answer the question directly there is an article about C#-s async/await vs Go's goroutines, which compare the two approaches, and while some of the stuff is probably stack-specific, a lot of it is probably intrinsic to the approach:

- Green threads scale somewhat better, but both scale ridiculously well, meaning probably you won't run into scaling issues.

- async/await generators use way less memory than a dedicated green thread, this affects both memory consumption and startup time, since the process has to run around asking the OS for more memory

- green threads are faster to execute

Here's the link:

https://alexyakunin.medium.com/go-vs-c-part-1-goroutines-vs-...

jillesvangurp4y ago

Loom will make a great backend for kotlin's co-routines. Roman Elizarov (kotlin language lead & person who is behind Kotlin's co-routine framework) has already confirmed that will happen and it makes a lot of sense.

For those who don't understand this, Kotlin's co-routine framework is designed to be language neutral and already works on top the major platforms that have kotlin compilers (native, javascript, jvm, and soon wasm). So, it doesn't really compete with the "native" way of doing concurrent, aynchronous, or parallel computing on any of those platforms but simply abstracts the underlying functionality.

It's actually a multi platform library that implements all the platform specific aspects in the platform appropriate way. It's also very easy to adapt existing frameworks in this space via Kotlin extension functions and the JVM implementation actually ships out of the box with such functions for most common solutions on the JVM for this (Java's threads, futures, threadpools, etc., Spring Flux, RxJava, Vert.x, etc.). Loom will be just another solution in this long list.

If you use Spring Boot with Kotlin for example, rather than dealing with Spring's Flux, you simply define your asynchronous resources as suspend functions. Spring does the rest.

With Kotlin-js in a browser you can call Promise.toCoroutine() ans async { ... }.asPromise(). That makes it really easy to write asynchronous event handling in a web application for example or work with javascript APIs that expect promises from Kotlin. And if you use web-compose, fritz2, or even react with kotlin-js, anything asynchronous, you'd likely be dealing with via some kind of co-routine and suspend functions.

Once Loom ships, it basically will enable some nice, low level optimization to happen in the JVM implementation for co-routines and there will likely be some new extension functions to adapt the various new Java APIs for this. Not a big deal but it will probably be nice for situations with extremely large amounts of co-routines and IO. Not that it's particularly struggling there of course but all little bits help. It's not likely to require any code updates either. When the time comes, simply update your jvm and co-routine library and you should be good to go.

richdougherty4y ago

I made a comment about this above: https://news.ycombinator.com/item?id=31218826

I won't repeat it all, but the main point is that having runtime support is much better than relying on compiler support, even if compiler support is pretty fantastic.

Note that the two aren't mutally exclusive, you should still be able to use coroutines after Project Loom ships, and it still might make sense in many places.

wiseowise4y ago

And how is that any different from Kotlin coroutines if you still need to call Thread.startVirtualThread?

pron4y ago

1. These are actual threads from the Java runtime's perspective. You can step through them and profile them with existing debuggers and profilers. They maintain stacktraces and ThreadLocals just like platform threads.

2. There is no need for a split world of APIs, some designed for threads and others for coroutines (so-called "function colouring"). Existing APIs, third-party libraries, and programs — even those dating back to Java 1.0 (just as this experiment does with Java 1.0's java.net.ServerSocket) — just work on millions of virtual threads.

Normally, you wouldn't even call Thread.startVirtualThread(), but just replace your platform-thread-pool-based ExecutorService with an ExecutorService that spawns a new virtual thread for each task (Executors.newVirtualThreadPerTaskExecutor()). For more details, see the JEP: https://openjdk.java.net/jeps/425

ferdowsi4y ago

Kotlin coroutines are colored and infect your whole codebase. Virtual threads do not.

wiseowise4y ago

You can mark everything suspend and there's no difference.

pjmlp4y ago

Native VM support instead an additional library faking it, and filling .class files with needless boilerplate.

KingOfCoders4y ago

Something to learn for everybody, the article is mainly about Linux tuning.

jeroenhd4y ago

The Linux tuning part seems to have been inspired by these blog posts from 14 years ago: https://www.metabrew.com/article/a-million-user-comet-applic...

It's almost a little disappointing that beefy modern servers only manage a x5 scale improvement, though that could be due to the differences in runtime behaviour between Erlang and the JVM.

toast04y ago

I mean... is 5M very impressive? Not really. Does it show that Project Loom meets the goal of being able to do large client count thread per server workloads? I think so. Does the name remind me of a best selling point and click adventure game? Definitely yes.

torginus4y ago

While impressive, I don't really see it as something practical - I think scaling across processes/VMs is a much more realistic approach.

zinxq4y ago

Loom sets out to give you a sane programming paradigm similar to what threads do (i.e. as opposed to programming asynchronous I/O in Java with some type of callback) without the overhead of Operating System threads.

That's a very cool and a noble pursuit. But the title of this article might as well have been "5M persistent connections with Linux" because that's where the magic 5M connections happen.

I could also attempt 5M connections at the Java level using Netty and asynchronous IO - no threads or Loom. Again, it'd take more Linux configuration than anything else. If that configuration did happen though now you can also do it in C# async/await, javascript, I'm sure Erlang and anything else that does Asynchronous I/O whether it's masked by something like Loom/Async/Await or not.

pron4y ago

It is true that the experiment exercises the OS, but that's only part of the point. The other part is that it uses a simple, blocking, thread-per-request model with Java 1.0 networking APIs. So this is "achieving 5M persistent connections with (essentially) 26-year-old code that's fully debuggable and observable by the platform." This stresses both the OS and the Java runtime.

So while you could achieve 5M in other ways, those ways would not only be more complex, but also not really observable/debuggable by Java platform tools.

cheradenine_uk4y ago

This.

Writing the sort of applications that I get involved with, it's frequently the case whilst it's true that 1 OS thread/java thread was a theoretical scalability limitation - in practice we were never likely to hit it (and there was always the 'get a bigger computer').

But: the complexity mavens inside our company and projects we rely upon get bitten by an obsessive need to chase 'scalability' /at all costs/. Which is fine, but the downside to that is the negative consequences of coloured functions comes into play. We end up suffering having to deal with vert.x or kotlin or whatever flavour-of-the-month solution is that is /inherently/ harder to reason about than a linear piece of code. If you're in a c# project, the you get a library that's async, and boom, game over.

If loom gets even within performance shouting distance of those other models, it's ought to kill (for all but the edgiest of edge-cases) reactive programming in the java space dead. You might be able to make a case - obviously depending on your use cases which are not mine - that extracting, say, 50% more scalability is worth the downsides. If that number is, say, 5%, then for the vast majority of projects the answer is going to be 'no'.

I say 'ought to', as I fear the adage that "developers love complexity the way moths love flames - and often with the same results". I see both engineers and projects (Hibernate and keycloak, IIRC) have a great deal of themselves invested in their Rx position, and I already sense that they're not going to give it up without a fight.

So: the headline number is less important than "for virtually everyone you will no longer have to trade simplicity for scalability". I can't wait!

2 more replies

zinxq4y ago

I think we're in agreement. Ignoring under the hood - Loom's programming paradigm (from the viewpoint of control flow) is the Threading programming paradigm. (Virtual)Thread-per-connection programming is easier and far more intuitive than asynchronous (i.e. callback-esque) programming.

I still attest though - The 5M connections in this example is still a red herring.

Can we get to 6M? Can we get to 10M? Is that a question for Loom or Java's asynchronous IO system? No - it's a question for the operating system.

Loom and Java NIO can handle probably a billion connections as programmed. Java Threads cannot - although that too is a broken statement. "Linux Threads cannot" is the real statement. You can't have that many for resource reasons. Java Threads are just a thin abstraction on top of that.

Linux out of the box can't do 5M connections (last I checked). It takes Linux tuning artistry to get it there.

Don't get me wrong - I think Loom is cool. It's attempted to do the same thing as Async/Await tried - just better. But it is most definitely not the only way to achieve 5MM connections with Java or anything else. Possibly however, it's the most friendly and intuitive way to do it.

*We typically vilify Java Threads for the Ram they consume. Something like 1M per thread or something (tunable). Loom must still use "some" ram per connection although surely far far less (and of course Linux must use some amount of kernel ram per connection too).

2 more replies

simulate-me4y ago

As the GP said, what's cool about this is how simple the code is. You might be able to achieve 5M connections in Java using an event loop based solution (eg Netty), but if the connection handlers need to do any async work, then they also need to be written using an event loop, which is not how most people write Java. Simply put, 5M connections was not possible using Java in the way most people write Java.

notorandit4y ago

With a maximum of 64k TCP connections per single server IP, you need 77 different IP on the server side. This is a fact.

NovemberWhiskey4y ago

What?

Having run production services that had over 250,000 sockets connecting to a single server port, I'm calling "nope" on that.

Are you thinking of the ephemeral port limit? That's on the client side; not the server side. Each TCP socket pair is a four-tuple of [server IP, server port, client IP, client port]; the uniqueness comes from the client IP/port part in the server case.

jeroenhd4y ago

You don't really need 77 IP addresses (the 64k limit for TCP is per client IP, per source port, per server IP) but even if you did, your average IPv6 server will have a few billion available. Every client can connect to a server IP of their own if you ignore the practical limits of the network acceleration and driver stack. If you're somehow dealing with this scale, I doubt you'll be stuck with pure legacy IP addressing.

The real problem with such a setup is that you're not left with a whole lot of bandwidth per connection, even if you ignore things like packet loss and retransmits mucking up the connections. Most VPS servers have a 1gbps connection, with 5 million clients that leaves 200 bytes per second of concurrent bandwidth for TCP signaling and data to flow through. You'll need a ridiculous network card for a single server to deal with such a load, in the terabits per second range.

peq4y ago

Isn't this limit per client ip, server ip, and server port? (https://stackoverflow.com/a/2332756/303637)

ivanr4y ago

I imagine that's the limit per client IP address [for a single server port], no? The Linux kernel can use multiple pieces of information to track connections: client IP address, client port, server IP address, server port.

Cloudflare has some interesting blog posts on this topic:

- https://blog.cloudflare.com/how-we-built-spectrum/

- https://blog.cloudflare.com/how-to-stop-running-out-of-ephem...

alanfranz4y ago

“You need 77 ips” to do what? May be a fact or not, depending on what you’re doing.

If you suppose just one open server port, you’ll probably need 77 client ips to do this test to get unique socket pairs.

But it’s a client problem, not a server one.

jauer4y ago

How do you figure?

Clients can connect to the server on the same server port, so connection limit is more like 64k*2 for every Client IP-Server IP pair.

akvadrako4y ago

Actually every client IP+port / server IP+port pair. Linux uses 60999 − 32768 for ephemeral ports so can support 28e3^2 = 784 million connections per IP pair.

1 more reply

imperio594y ago

Pretty sure you can bump that up in the kernel to hold more active connections per server that 64k...

Nullabillity4y ago

Loom is missing the point.

Time has shown that bare threads are not a viable high-level API for managing concurrency. As it turns out, we humans don't think in terms of locks and condvars but "to do X, I first need to know Y". That maps perfectly onto futures(/promises). And once you have those, you don't need all the extra complexity and hacks that green threads (/"colourless async") bring in.

I'd take a system that combined the API of futures with the performance of OS threads over the opposite combination, any day of the week. But as it turns out, we don't have to choose. We can have the performance of futures with the API of futures.

Or we can waste person-years chasing mirages, I guess. I just hope I won't get stuck having to use the end product of this.

pron4y ago

I think you're mixing specific synchronisation/communication mechanisms with the basic concept of a thread, which is simply the sequential composition of instructions that is known and observable by the runtime. If you like the future/promise API, that will work even better with threads, because then the sequence is a reified concept known to the runtime and all its tools. You'll be able to step through the sequence of operations with a debugger; the profiler will know to associate operations with their context. What API you choose to compose your operations, whether you prefer message passing with no shared state, shared state with locks, or a combination of the two — that's all orthogonal to threads. All they are is a sequantial unit of instructions that may run concurrently to other such units, and is traceable and observable by the platform and its tools.

Nullabillity4y ago

You can implement futures by just running each future as a thread, but it doesn't really give you much. It's a lot more complex to write a preemptive thread scheduler + delegating future scheduler than to just write a future scheduler in the first place.

Especially when that future scheduler already exists and works, and the preemptive one is a multi-year research project away.

1 more reply

groestl4y ago

If I look at a thread, I see futures all over the place. They're just implicit, and the OS takes care of concurrency/preemption. Sure, that means that you need concurrency primitives if you access shared resources, but only in the trivial case you can get away without shared state in the promise/future scenario as well (i.e. glue code that ties together the hard stuff). Downside is your code gets convoluted and your stacktraces suck.

rvcdbn4y ago

Maybe threads don’t work for your thinking style but your claim that this is generally true is baseless and pretty well refuted by languages like Go or Erlang that feature stackfull threads/processes as a critical part of their best-in-class concurrency stories.

Nullabillity4y ago

Erlang sidesteps the problem by avoiding mutable shared state, in this context they're threads/processes in name only.

Go is just yet another implementation of green threads that is slightly less broken than prior implementations, because it had the benefit of being implemented on day 1 (so the whole ecosystem is green thread-aware). It's certainly nowhere near "best-in-class".

2 more replies

IshKebab4y ago

Threads have essentially the same API as Futures - normally you have some join of join handle and you can join a set of threads (the equivalent of awaiting a set of futures).

Threads don't require locks and condvars. You can use channels and scoped joins etc. if you want.

Give me some async code and I'll show you an easier threaded version.

bpicolo4y ago

The goroutine model in go is plenty conceptually simple for concurrency. Correct me if I'm wrong, but loom seems similar in that sense?

I don't find myself missing out on futures in Go.

j / k navigate · click thread line to collapse

145 comments

cheradenine_uk4y ago

I think a lot of people are missing the point.

You no longer have to worry about any of that.

pjmlp4y ago

Or inserting the occasional Task.Run() calls, as means to avoiding changing the whole call stack up to Main().

gavinray4y ago

This hasn't been that much of a problem, IME

1 more reply

bullen4y ago

Agreed it's simpler, but using NIO with one OS thread per core also has it's benefits.

The context switch (how ever small) will cause latency when this solution is at saturation.

I think they should write four tests: fiber, NIO and each with userspace networking (no kernel copying network memory) and compare them.

Why Oracle is stalling removing the kernel for Java networking is surprising to me, they allready have a VM.

pron4y ago

https://github.com/ebarlas/project-loom-comparison

2 more replies

blibble4y ago

there's still a context switch with NIO, you're just doing it manually

1 more reply

SemanticStrengh4y ago

Except Kotlin coroutines already works, can be very easily integrated in existing java codebases and are much superior than loom (structured concurrency, flow, etc)

richdougherty4y ago

https://kotlinlang.org/spec/asynchronous-programming-with-co...

I think this is a general principle about compiler features vs runtime features. Having things in the runtime makes life a lot easier for everyone, at the cost of runtime complexity, of course.

Another one I'd like to see is native support for tail calls in Java. Kotlin, Scala, etc have to do compile-time tricks to get basic tail call support, but it doesn't work across functions well.

Scala and Kotlin both ask the programmer to add annotations where tail calls are needed, since the code gen so often fails.

https://kotlinlang.org/docs/functions.html#tail-recursive-fu...

https://www.scala-lang.org/api/3.x/scala/annotation/tailrec....

https://rd.nz/2009/04/tail-calls-tailrec-and-trampolines.htm...

As a side note, I can see that tail calls are planned for Project Loom too, but I haven't heard if that's implemented yet. Does anyone know the status?

* Virtual threads

* Delimited continuations

* Tail-call elimination"

https://wiki.openjdk.java.net/display/loom/Main

1 more reply

pron4y ago

For more information about virtual threads see https://openjdk.java.net/jeps/425 (planned to preview in JDK 19, out this September).

What's remarkable about this experiment is that it uses simple 26-year-old (Java 1.0) networking APIs.

midislack4y ago

jpollock4y ago

This isn't about the hardware, it's about thread count.

There are limits in the linux kernel, and the 5m concurrent connections was chosen to exceed it.

In order to have more than 4m connections, the existing Java code either needs to be changed to be event driven, or it can't use kernel threads.

If a developer can write "regular" non-re-entrant Java code and still get the concurrent connections? Win all around.

shadowpho4y ago

Raspberry pi 4 performance changes wildly based on cooling. Bare die vs heatsink vs heatsink + fan will give you wildly different results.

midislack4y ago

Same is true with any computer these days. So let's go no heat sink, Pi 4 4GB anyway.

1 more reply

niederman4y ago

> Everybody can find one

LMAO I wish.

https://rpilocator.com/?cat=PI4

kmelva4y ago

Could a 128c Threadripper even do 5M kernel threads?

TYMorningCoffee4y ago

I was only able to get to 840,000 open connections with my experiment. My machine only has 8GB of memory. https://josephmate.github.io/2022-04-14-max-connections/

Is there anyway for the TCP connections share memory in kernel space? My experiment only uses two 8 byte buffers in userspace.

mh-4y ago

no*, and as you've discovered, the skbufs allocated by the kernel will often be the limiting factor for a highly concurrent socket server on linux.

* I don't know if someone has created some experimental implementation somewhere. It would require a significant overhaul of the TCP implementation in the kernel.

edit: check out this sibling thread about userland TCP. I think this is a more interesting/likely direction to explore in. https://news.ycombinator.com/item?id=31215569

toast04y ago

Btw, as a former WhatsApp server engineer, WhatsApp listens on three ports; 80, 443, and 5222. Not that that makes a significant difference in the content.

Matthias2474y ago

But independent of socket buffers, the kernel obviously needs to allocate other state per socket, which tracks the state of the TCP connection.

sgtnoodle4y ago

I'm not a java programmer. I tried clicking 3 layers deep of links, but still have no idea what virtual threads are in this context. Is it a userspace thread implementation?

grishka4y ago

> Is it a userspace thread implementation?

Yes.

christophilus4y ago

Loom looks like it’s nicely solved the function coloring problem. This plus Graal makes me excited to pick up Clojure again.

10000truths4y ago

A bit of a digression, but I’d love to see how much further one could go with a memory-optimized userland TCP stack, and storing the send and receive buffers on disk.

mike_hearn4y ago

It's interesting to think about though, I agree. What are the next scaling bottlenecks now (for JVM compatible languages) threading is nearly solved?

toast04y ago

You're totally spot on that connection establishment is much more challenging than steady state; with TLS or just TCP.

1 more reply

adra4y ago

natdempk4y ago

charcircuit4y ago

I think you meant to say TLS. Not SSL.

toast04y ago

metabrew4y ago

API for the server example looks... actually good, wow. Nice job!

Also tickled to see my erlang 1M comet blog post referenced. A lifetime ago now, pre-websockets.

nelsonic4y ago

Reminds of https://phoenixframework.org/blog/the-road-to-2-million-webs... Would love to see this extended to more Languages/Frameworks.

mike_hearn4y ago

In theory once Graal adds support for it, any Graal/Truffle-compatible language can benefit.

For the Loom approach to work you need:

a. Very tight and difficult integration between the compiler, threading subsystem and garbage collector.

kaba04y ago

1 more reply

bkolobara4y ago

[0]: https://github.com/lunatic-solutions/lunatic

wiradikusuma4y ago

The experiment is about Java app, but the tweaks are at the O/S level. Does it mean any app (Java/not, Loom/not) can achieve target given correct tweak?

Also, why are these not default for the O/S? What are we compromising by setting those values?

toast04y ago

mike_hearn4y ago

pjmlp4y ago

With C++ co-routines and a runtime like HPX, not really.

However there are other reasons why a C++ applications connected to the internet might indeed die faster than a Java one.

gpderetta4y ago

There have been userspace thread libraries for c++ for decades.

1 more reply

jiggawatts4y ago

Some back of the envelope maths: https://www.wolframalpha.com/input?i=100+Gbps+%2F+5+million

If the server had a 100 Gbps Ethernet NIC, this would leave just 20 kbps for each TCP connection.

I could imagine some IoT scenarios where this might be a useful thing, but outside of that? I doubt there's anyone that wants 20 kbps throughput in this day and age...

It's a good stress test however to squeeze out inefficiencies, super-linear scaling issues, etc...

jeroenhd4y ago

1 more reply

Koffiepoeder4y ago

Open, idle websockets can be a use case for a large amount of tcp connections with a small data footprint.

1 more reply

invalidname4y ago

This is pretty fantastic!

isbvhodnvemrwvn4y ago

Spring Boot overhead would likely make that infeasible.

RhodesianHunter4y ago

Spring boot overhead is largely in startup time. It really doesn't have much overhead there after.

It's largely a collection of the same libraries you would use anyways glued together with a custom di system.

invalidname4y ago

I'm not saying 5M. I just want to see to what scale it would get without threading issues. Spring Boot isn't THAT heavy.

the84724y ago

   net.netfilter.nf_conntrack_buckets = 1966050
   net.netfilter.nf_conntrack_max = 7864200

or avoid conntrack entirely

LinuxBender4y ago

For completeness sake I would add that one must also set

  options nf_conntrack expect_hashsize=X hashsize=X

in /etc/modules.d/nf_conntrack.conf, X being 1/4 the size of conntrack_max

alberth4y ago

Is this a test of just having 5M people knock on your door?

Or is this a test where something actually happens (data exchanges) with each connection?

I ask because those are two totally different workloads and typically where in the later test Erlang shines.

bufferoverflow4y ago

It's an echo server. The client sends the data, the server responds with the same data.

imranhou4y ago

It looks more closer to go routines, which to me begs the question - where are the channels that I could use to communicate between these virtual threads?

adra4y ago

Matthias2474y ago

1 more reply

sdfgdfgbsdfg4y ago

In a library. Loom is more about adapting the JVM itself for continuations and virtual threads than adding to userspace.

Andrew_nenakhov4y ago

Sounds like a job for Erlang.

speed_spread4y ago

Sounds like Erlang's out of a job.

Andrew_nenakhov4y ago

No.

deepsun4y ago

How does that compare to Kotlin suspend functions?

torginus4y ago

- Green threads scale somewhat better, but both scale ridiculously well, meaning probably you won't run into scaling issues.

- async/await generators use way less memory than a dedicated green thread, this affects both memory consumption and startup time, since the process has to run around asking the OS for more memory

- green threads are faster to execute

Here's the link:

https://alexyakunin.medium.com/go-vs-c-part-1-goroutines-vs-...

jillesvangurp4y ago

If you use Spring Boot with Kotlin for example, rather than dealing with Spring's Flux, you simply define your asynchronous resources as suspend functions. Spring does the rest.

richdougherty4y ago

I made a comment about this above: https://news.ycombinator.com/item?id=31218826

I won't repeat it all, but the main point is that having runtime support is much better than relying on compiler support, even if compiler support is pretty fantastic.

Note that the two aren't mutally exclusive, you should still be able to use coroutines after Project Loom ships, and it still might make sense in many places.

wiseowise4y ago

And how is that any different from Kotlin coroutines if you still need to call Thread.startVirtualThread?

pron4y ago

ferdowsi4y ago

Kotlin coroutines are colored and infect your whole codebase. Virtual threads do not.

wiseowise4y ago

You can mark everything suspend and there's no difference.

pjmlp4y ago

Native VM support instead an additional library faking it, and filling .class files with needless boilerplate.

KingOfCoders4y ago

Something to learn for everybody, the article is mainly about Linux tuning.

jeroenhd4y ago

The Linux tuning part seems to have been inspired by these blog posts from 14 years ago: https://www.metabrew.com/article/a-million-user-comet-applic...

It's almost a little disappointing that beefy modern servers only manage a x5 scale improvement, though that could be due to the differences in runtime behaviour between Erlang and the JVM.

toast04y ago

torginus4y ago

While impressive, I don't really see it as something practical - I think scaling across processes/VMs is a much more realistic approach.

zinxq4y ago

That's a very cool and a noble pursuit. But the title of this article might as well have been "5M persistent connections with Linux" because that's where the magic 5M connections happen.

pron4y ago

So while you could achieve 5M in other ways, those ways would not only be more complex, but also not really observable/debuggable by Java platform tools.

cheradenine_uk4y ago

This.

So: the headline number is less important than "for virtually everyone you will no longer have to trade simplicity for scalability". I can't wait!

2 more replies

zinxq4y ago

I still attest though - The 5M connections in this example is still a red herring.

Can we get to 6M? Can we get to 10M? Is that a question for Loom or Java's asynchronous IO system? No - it's a question for the operating system.

Linux out of the box can't do 5M connections (last I checked). It takes Linux tuning artistry to get it there.

2 more replies

simulate-me4y ago

notorandit4y ago

With a maximum of 64k TCP connections per single server IP, you need 77 different IP on the server side. This is a fact.

NovemberWhiskey4y ago

What?

Having run production services that had over 250,000 sockets connecting to a single server port, I'm calling "nope" on that.

jeroenhd4y ago

peq4y ago

Isn't this limit per client ip, server ip, and server port? (https://stackoverflow.com/a/2332756/303637)

ivanr4y ago

Cloudflare has some interesting blog posts on this topic:

- https://blog.cloudflare.com/how-we-built-spectrum/

- https://blog.cloudflare.com/how-to-stop-running-out-of-ephem...

alanfranz4y ago

“You need 77 ips” to do what? May be a fact or not, depending on what you’re doing.

If you suppose just one open server port, you’ll probably need 77 client ips to do this test to get unique socket pairs.

But it’s a client problem, not a server one.

jauer4y ago

How do you figure?

Clients can connect to the server on the same server port, so connection limit is more like 64k*2 for every Client IP-Server IP pair.

akvadrako4y ago

Actually every client IP+port / server IP+port pair. Linux uses 60999 − 32768 for ephemeral ports so can support 28e3^2 = 784 million connections per IP pair.

1 more reply

imperio594y ago

Pretty sure you can bump that up in the kernel to hold more active connections per server that 64k...

Nullabillity4y ago

Loom is missing the point.

Or we can waste person-years chasing mirages, I guess. I just hope I won't get stuck having to use the end product of this.

pron4y ago

Nullabillity4y ago

Especially when that future scheduler already exists and works, and the preemptive one is a multi-year research project away.

1 more reply

groestl4y ago

rvcdbn4y ago

Nullabillity4y ago

Erlang sidesteps the problem by avoiding mutable shared state, in this context they're threads/processes in name only.

2 more replies

IshKebab4y ago

Threads have essentially the same API as Futures - normally you have some join of join handle and you can join a set of threads (the equivalent of awaiting a set of futures).

Threads don't require locks and condvars. You can use channels and scoped joins etc. if you want.

Give me some async code and I'll show you an easier threaded version.

bpicolo4y ago

The goroutine model in go is plenty conceptually simple for concurrency. Correct me if I'm wrong, but loom seems similar in that sense?

I don't find myself missing out on futures in Go.

j / k navigate · click thread line to collapse