High performance services using coroutines (opens in new tab)

(medium.com)

69 pointsgmosx11y ago15 comments

15 comments

Writing high performance server software is a whole other world from writing services. In the first, data copies can cause unacceptable slow-downs of a fraction of a millisecond - and those copies may be hidden all the way down at the driver level. When writing services (which are just as often written in Ruby or Python) trading off a few milliseconds for safety is often a worthy thing.

I spend most of my time writing services, and when looking at a language like Go, there's a reason the default is pass-by-value. Passing around pointers to multiple co-routines is asking for a race condition, and in non-GC languages, null pointers. Services are rarely written with the precision of high-performance servers.

I don't envy the server writers (although I can see how it would be fun!). Giving up a few milliseconds per request to make sure my co-routines aren't sharing pointers is worthwhile, and I appreciate the safety that gives me. I'm sure someone will mention that Rust could give me the safety I was looking for in a non-GC language, but that's the point, isn't it - that by being able to game the system, you can gain a few precious microseconds here and there that enforced safety might cost you.

kibwen11y ago

  > I'm sure someone will mention that Rust could give me 
  > the safety I was looking for in a non-GC language, but 
  > that's the point, isn't it

I'm not sure what you're trying to say here. There is no need to game the system, as Rust lets you pass pointers between threads while enforcing that they are used safely. Rust lets you be as fast as an unsafe language while as safe as a managed language; the tradeoff is that you have to give more thought to your design up-front (which, given the pain of trying to reproduce and fix race conditions, is more than worth the effort IMO).

aaronlevin11y ago

This sounds very similar to warp, the Haskell web server that is known for its high-performance on many-core machines. The Web Application Interface (WAI) is also coroutine based (I believe): http://www.aosabook.org/en/posa/warp.html

raksoras11y ago

Sorry for the shameless plug but I just released HTTP server based on Lua coroutines (https://github.com/raksoras/luaw) that pretty much follows the exact design from the OP article.

markpapadakis11y ago

This is pretty cool. Thanks for sharing.

marktangotango11y ago

>> If not, the coroutine will yield, and at the same time, be scheduled(migrated) to another set of threads responsible for the ‘slow’ requests(ones that need to block). Those will just schedule the coro in their own scheduler, and perform the read again and continue.

>> Alternatively, we could just yield the coro, waiting for a new special coroutine that would run in a background thread just executing readahead() and then re-scheduling(making runnable) the original coro, waiting for readahead.

Seems to me this scheme will ultimately be limited by the slower request. Performing fast and slow operations is essentially Little's Law[1] where the average time dominates. However if the slow/blocking reads where also async, I think you'd eventually be limited by io speed?

[1] http://en.wikipedia.org/wiki/Little%27s_law

markpapadakis11y ago

I am familiar with Little's Law. The key concept is optimistic sequential execution. The overwheleming majority of requests do not need to page in cold-data or otherwise stall for too long. Also, in the case where they can stall to e.g process a lot of data in a loop, they can yield back to the scheduler, thereby being fair to all other accepted coroutines/requests and giving a chance to others to complete earlier. In practice, at least based on my prototype and tests, this works extremely well. If about 10-15% of the requests may end up blocking the thread, while the read can read kernel-cache resident data and proceed, that's really huge in terms of performance gains. The background threads that either become owners and responsible for running slow/blocking coros can rely on a work-strealing scheme to keep them all busy, and if you don't rely on them, thereby removing contention/stall from the 'fast' threads, you are going to eventually suffer stalls and latency spikes.

SixSigma11y ago

Syscalls are slow. See my previous HN post

https://news.ycombinator.com/item?id=8961582

rubiquity11y ago

The end implementation sounds a lot like... Erlang. With the exception that this solution avoids allocations that Erlang probably makes due to immutability.

markpapadakis11y ago

This is a C++ implementation. Erlang and other high level languages use Processes(in the case of Erlang) or other such constructs to encapsulate logic and those are indeed scheduled based on either cooperative/preemptive fashion.

The main goals of this system is to minimize potential for blocking or otherwise slow processing steps which can stall processing of other requests, especially if this happens for many such requests/second.

Specifically, the 'optimistic sequential execution' idea is that you run the request on the request it was accepted in, use mincore() to determine if will _likely_ stall if you read from it, and if so, either migrate the coro to a 'slow' coros threads list, or just keep on going (which should be the fast-path here).

This is a solution to a specific class of problems. The similarities with Erlang's Process scheduling is that a process is a lot like a coroutine (it encapsulates logic and data and it's scheduled by a user-space scheduler, and that there can be thousands of them in runnable state, and each has its own stack). But that's really the definition of stackless coroutines.

amelius11y ago

Coroutines are essentially a form of cooperative multitasking. I'm not sure we should be using that in this day and age, especially considering that requests are not purely I/O bound in all cases, and could depend on actual computation.

markpapadakis11y ago

The fewer the trips to the kernel, the more work you can do in user-space, the fewer the sys-calls, the fewer the context switches, the better the throughput and performance.

Even if a request is completely CPU-bound, it’s a good idea to be fair to other requests accepted in the thread; that is, if a request should take 2 seconds of processing time, while others would take a few microseconds, they shouldn’t all wait until that one request is processed.

Instead, yielding back to the scheduler where appropriate, will give them a chance to complete in time proportional to the effort required to process them, not to the time required to handle long-running requests that stall processing of other coros.

amelius11y ago

Yielding will take CPU time, even when no yielding is necessary. And where to put these yields? In an inner-loop? Should all loops be rewritten to ensure a yield takes place at least every N instructions? Considering all this, I think a better approach would be to optimize the thread scheduler in the kernel, assuming it is as inefficient as you are suggesting.

1 more reply

j / k navigate · click thread line to collapse

15 comments

themartorana11y ago

kibwen11y ago

  > I'm sure someone will mention that Rust could give me 
  > the safety I was looking for in a non-GC language, but 
  > that's the point, isn't it

aaronlevin11y ago

raksoras11y ago

Sorry for the shameless plug but I just released HTTP server based on Lua coroutines (https://github.com/raksoras/luaw) that pretty much follows the exact design from the OP article.

markpapadakis11y ago

This is pretty cool. Thanks for sharing.

marktangotango11y ago

[1] http://en.wikipedia.org/wiki/Little%27s_law

markpapadakis11y ago

SixSigma11y ago

Syscalls are slow. See my previous HN post

https://news.ycombinator.com/item?id=8961582

rubiquity11y ago

The end implementation sounds a lot like... Erlang. With the exception that this solution avoids allocations that Erlang probably makes due to immutability.

markpapadakis11y ago

amelius11y ago

markpapadakis11y ago

The fewer the trips to the kernel, the more work you can do in user-space, the fewer the sys-calls, the fewer the context switches, the better the throughput and performance.

amelius11y ago

1 more reply

j / k navigate · click thread line to collapse