I'm hopeful that eventually promises and async/await fade into history as a fad that turned out to be too unwieldy. I think that lightweight processes with no shared memory, connected by streams (the Erlang/Elixer, Go and Actor model) are the way to go. The advantage to using async/await seems to be to avoid the dynamic stack allocation of coroutines, which can probably be optimized away anyway. So I don't see a strong enough advantage in moving from blocking to nonblocking code. Or to rephrase, I don't see the advantage in moving from deterministic to nondeterministic code. I know that all of the edge cases in a promise chain can be handled, but I have yet to see it done well in deployed code. Which makes me think that it's probably untenable for the mainstream.
So I'd vote to add generators to the Rust spec in order to make coroutines possible, before I'd add futures/promises and async/await. But maybe they are all equivalent, so if we have one, we can make all of them, not sure.
In my experience, the fact that they are stackless is not at all obvious when you're coding with them. Rust makes working with them really simple and intuitive.
For example, lisp-based languages like Clojure can be statically analyzed and then parallelized so that all noninteracting code runs in its own process. This can also be done for code that operates on vectors like MATLAB and TensorFlow.
For me, isolated processes under the Actor model in languages like Elixer/Erlang and Go is much simpler conceptually than async/await, which is only one step above promises/futures, which is only one step above callback hell. I know that the web world uses async/await for now, but someday I think that will be replaced with something that works more like Go.
No, at least not in general. There are a lot of problems in the real world for which "no shared memory" is incompatible with "efficient parallelism".
For example, the copy on write (COW) mechanism of unix where memory pages are mapped to the same location until a forked process writes to one, in which case the virtual memory manager makes a mutable copy.
There's also Redux and Closure's immutable state tree that makes copies under a similar mechanism to COW but through code, since they run at a level of abstraction above C++ or Rust.
My feeling is that these techniques run within a few percent of the speed of hand-optimization. But in the real world, I've seen very little human code remain optimized over the long term. Someone invariably comes along who doesn't understand the principles behind the code and inadvertently does a manual copy somewhere or breaks the O() speed of the algorithm by using the wrong abstractions. Meanwhile immutable code using mechanisms like COW avoids these pitfalls because the code is small and obvious.
I feel that the things that Rust is trying to do were solved long ago under FP, so I don't think it's the language for me. That's also why I moved away from C#, Java, C++, etc. Better languages might be Elixer or Clojure/ClojureScript, although they still have ugly syntax from a mainstream perspective compared to say Javascript or Python. I love that Rust exists as a formal spec of a mature imperative programming language. I think it's still useful in kernels and the embedded space. But I'm concerned that it's borrowing ideas like async/await that trade determinism for performance.
This seems interesting. Do you have any pointers to places/papers I can look more into this? I'm also curious, since the stacks have to be rather small when you are running several thousands of coroutines (like Go), how often people get into issues of running out of stack because of some big stack allocation somewhere and stuff like that.
Ok it looks like current techniques are stackless runtimes and compiling coroutines to stackless continuations:
https://en.wikipedia.org/wiki/Stackless_Python
https://stackless.readthedocs.io/en/v3.6.4-slp/library/stack...
http://jessenoller.com/blog/2009/02/23/stackless-you-got-you...
https://engagedscholarship.csuohio.edu/cgi/viewcontent.cgi?r...
https://pdfs.semanticscholar.org/b9aa/49e4b7a00e6c9f0d8c18ba...
https://www.osnews.com/story/9822/protothreads-extremely-lig...
This looks like a rare gem, although I just started reading it:
https://cs.indiana.edu/~dfried/dfried/mex.pdf
I grew up with the cooperative multithreading of classic Mac OS and was really shocked when I first saw Javascript back in the 90s and it had no notion of it (because it didn't have generators). That sent us down the callback hell evolutionary dead end, through promises/futures and finally to async/await where we are now. That could have been largely avoided if we had listened to programming language experts!
Async/await is a great feature for those who require performance beyond what threads (backed by either 1:1 or M:N implementations) can provide. One of the major reasons behind the superior performance of async/await futures relative to threads/goroutines/etc. is that async/await compiles to a state machine in the manner described in this post, so a stack is not needed.
However in multithreaded languages it's always an AND. Once you add async/await people need to know how traditional threading as well as how async/await works. Rust will also make that very explicit. Even if you use async/await on a single thread the compiler will prevent accessing data from more than one task - even it those are running on the same thread. So you need synchronization again. With maybe the upside that this could be non-thread-safe (or !Sync in Rust terms) in order to synchronize between tasks that run on the same thread. But however also with the downside that utilizing the wrong synchronization primitives (e.g. thread-blocking ones) in async code can stall also other tasks.
Overall I think using async/await in Rust is strictly more complex. But it has its advantages in terms of performance, and being able to cancel arbitrary async operations.
You with application level concurrency you get 100x performance boost for network servers.
M:N was experimentally found to be slower than 1:1 in Rust.
Last time I saw there was still a couple extra allocations going on too in the compiler (I was told they were being worked on) and basically the default executor, tokio, wasn't very efficient at changing events in the queue (requiring a an extra self signal to get the job done).
I'd be interesting to see how little cost these are, because there is defintely a cost to the generator implementation. Yes, if I wrong a generator to do this, I couldn't write it better, but I wouldn't write a generator (and that would be a very odd definition of zero-cost there anything can be called zero cost even GC as long as it is implemented well - well, that depends on if rustc saves unnecessary state).
> Additionally, it should allow miri to validate the unsafe code inside our generators, checking for uses of unsafe that result in undefined behavior.
This is really useful as a lot of the buffer handing code needs to use unsafe for efficienty issues. And the enums sharing values is nice too - hopefully the extra pointer derferences can be optimized out.
I do worry though about all this state sitting on the head and ruining cache locality on function calls though.
It's pretty clear that most people don't want to write and maintain that kind of code, though—that's what async/await is for. Personally, I hate writing state machines.
Could you expand on that?
1. Go can start stacks small and relocate them, because the runtime needed to implement garbage collection allows relocation of pointers into the stack by rewriting them. Rust has no such runtime infrastructure: it cannot tell what stack values correspond to pointers and which correspond to other values. Additionally, Rust allows for pointers from the heap into the stack in certain circumstances, which Go does not (I don't think, anyway). So what Rust must do is to reserve a relatively large amount of address space for each thread's stack, because those stacks cannot move in memory. (Note that in the case of Rust the kernel does not have to, and typically does not, actually allocate that much physical memory for each thread until actually needed; the reservation is virtual address space only.) In contrast, Go can start thread stacks small, which makes them faster to allocate, and copy them around as they grow bigger. Note that async/await in Rust has the potential to be more efficient than even Go's stack growth, as the runtime can allocate all the needed space up front in some cases and avoid the copies; this is the consequence of async/await compiling to a static state machine instead of the dynamic control flow that threads have.
2. Rust cares more about fast interoperability with C code that may not be compiled with knowledge of async I/O and stack growth. Go chooses fast M:N threading over this, sacrificing fast FFI in the process as every cgo call needs to switch to a big stack. This is just a tradeoff. Given that 1:1 threading is quite fast and scalable on Linux, it's the right tradeoff for Rust's domain, as losing M:N threading isn't losing that much anyway.
That's a great optimization but doesn't that mean it also breaks stack traces?
The C++ proposal definitely attacks the problem from a different angle than Rust. One somewhat surface-level difference is that it implements co_yield in terms of co_await, which is the opposite of Rust implementing await in terms of yield.
Another difference is that in Rust, all heap allocations of your generators/futures are explicit. In C++, technically every initialization of a sub-coroutine starts defaults to being a new heap allocation. I don't want to spread FUD: my understanding is that the vast majority of these are optimized out by the compiler. But one downside of this approach is that you could change your code and accidentally disable one of these optimizations.
In Rust, all the "state inlining" is explicitly done as part of the language. This means that in cases where you can't inline state, you must introduce an explicit indirection. (Imagine, say, a recursive generator - it's impossible to inline inside of itself! When you recurse, you must allocate the new generator on the heap, inside a Box.)
To be clear, the optimizations I'm talking about in the blog post are all implemented today. I'll be covering what they do and don't do, as well as future work needed, in future blog posts.
One benefit of C++ that you allude to is that there are a lot of extension points. I admit to not fully understanding what each one of them is for, but my feeling is that some of it comes from approaching the problem differently. Some of it absolutely represents missing features in Rust's initial implementation. But as I say in the post, we can and will add more features on a rolling basis.
The way I would approach the specific problem you mention is with a custom executor. When you write the executor, you control how new tasks are scheduled, and can add an API that allows specifying a task priority. You can also allow modifying this priority within the task: when you poll a task, set a thread-local variable to point to that task. Then inside the task, you can gain a reference to yourself and modify your priority.
I don't think this is correct. C++ 20 allows a lot of choices to implement it without forcing a heap allocation.
see https://lewissbaker.github.io/2017/11/17/understanding-opera... also see this video that goes in depth how to have billions of coroutines with C++: https://www.youtube.com/watch?v=j9tlJAqMV7U
On your last paragraph, the thing I'm concerned by is where this extra priority information is stored and propogated, as the term "task" is interesting: isn't every single separate thing being awaited its own task? There isn't (in my mental model) a concept that maps into something like a "pseudo-thread" (but maybe Rust does something like this, requiring a very structured form of concurrency?), which would let me set a "pseudo-thread" property, right?
As an example: if I am already in an asynchronous coroutine and I spawn of two asynchronous web requests as sub-tasks, the results of which will be processed potentially in parallel on various work queues, and then join those two tasks into a high-level join task that I wait on (so I want both of these things to be done before I continue), I'd want the background processing done on the results to be handled at the priority of this parent spawning task; do I have to manually propagate this information?
In C++2a, I would model this by having a promise type that is used for my prioritize-able tasks and, to interface with existing APIs (such as the web request API) that are doing I/O scheduling; I'd use await_transform to adapt their promise type into one of mine that lets me maintain my deadline across the I/O operation and then recover it in both of the subtasks that come back into my multi-threaded work queue. Everything I've seen about Rust seems to assume that there is a single task/promise type that comes from the standard library, meaning that it isn't clear to me how I could possibly do this kind of advanced scheduling work.
(Essentially, whether or not it was named for this reason--and I'm kind of assuming it wasn't, which is sad, because not enough people understand monads and I feel like it hurts a lot of mainstream programming languages... I might even say particularly Rust, which could use more monadic concepts in its error handling--await_transform is acting as a limited form of monad transformer, allowing me to take different concepts of scheduled execution and merge them together in a way that is almost entirely transparent to the code spawning sub-tasks. The co_await syntax is then acting as a somewhat-lame-but-workable-I-guess substitute for do notation from Haskell. In a perfect world, of course, this would be almost as transparent as exceptions are, which are themselves another interesting form of monad.)
A C++ coroutine chooses a particular promise type as part of its definition. Its frame defaults to a separate heap-allocation per coroutine, with some allowance for elision. At a suspension point, it passes a type-erased handle to the current coroutine to an `await_suspend` method, which can either `return false` or call `handle.resume()` to resume the coroutine. A stack of `co_await`ing coroutines (or "psuedo-thread" as you call it) is thus a linked list of `coroutine_handle`s stored in the coroutine frames of their await-ees, rooted with whoever is responsible for next resuming the coroutine.
A Rust async function does things inside out, in a sense. It has no promise type; calling one directly returns its frame into the caller's frame, as a value with an anonymous type that implements the `Future` trait. This trait has a single method called `poll`, which resumes the function and runs it until its next suspension point. `poll` takes a single argument, a handle which is used to signal when it is ready to continue. This handle is threaded down through a stack of `poll`s (a "task" or pseudo-thread), and stored with whoever is responsible for notifying the task it should continue.
One implication of the Rust approach is that the "executor" and the "reactor" are decoupled. An executor maintains a collection of running tasks and schedules them. A reactor holds onto those handles and notifies executors of relevant events. This lets you control scheduling without language hooks like await_transform- you can associate your prioritization data with a task when you spawn it on a particular executor, and it can await any reactor without losing that information.
Another implication is that you have a choice of whether to a) `await` a future, making it part of the current task, or b) spawn it as its own task, to be scheduled on its own, much like OS thread APIs. Option (a) can get really interesting with multiple concurrent sub-futures (with things like Promise.all or select); it can be as simple as having the caller poll all its children every time it wakes up, or as complex as wrapping `poll`'s handle argument and implementing your own scheduling within a task.
Rusts design compared to that was not only focused on async functions as the main design goal, but also on maintaining compatibility with a "struct Future" type which also can be implemented by hand and represents a state-machine.
The latter will allow Rust to reuse lots of async infrastructure that had been built in the (combinators & state-machine) Futures world in the last years (e.g. the tokio library and everything on top of it).
One downside of Rusts approach might be that some parts feel a bit awkward and hard, e.g. the 2 different states of Futures (one where it hasn't been executed and can be moved around and one where it has been started executing and can't be moved anymore) and the pinning system. As far as I understand C++ exposes less of those details to end-users - this might be something where the implicit allocations might have helped it.
As far as I understand the async function flavor of C++ coroutines also have run-to-completion semantics and can't be cancelled at any yield point like Rusts Futures can be. This has the advantage of being able to wrap IO completion based operations in a more natural fashion than Rust. But it then again has the downside that users need to pass CancellationTokens around for cancellation, and that some code might not be cancellable.
All of the data needed by a task is contained within its future. That means we can neatly sidestep problems of dynamic stack growth and stack swapping, giving us truly lightweight tasks without any runtime system implications. ... Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state—working just like hand-rolled code. [1]
[1] https://aturon.github.io/blog/2016/09/07/futures-design/
How are those tasks implemented, and what's scheduling them?
In Rust, you can "inline" an entire chain of futures into a single heap allocation.
1) which was the first language that introduced the async/await keywords? I want to say C#, but I’m not sure.
2) are there any papers that describe the code transformation to state machine that is commonly performed to implement these keywords?
I was quickly searching around and found this paper [1]:
"Pause 'n' play: formalizing asynchronous C#".
It looks promising, although it is behind a paywall ;-(. Also, the keyword "formalizing" tells me that maybe this goes a bit deeper than the kind of description I'm looking for...
It's nice they're making Futures a zero-cost abstraction, but it feels like it is at the expense of ergonomics.
For myself, I'd been waiting for the syntax to finalize as I'm still learning and didn't want to really delve into old and new ways, though I'm sure I will see them in practice. For others, depending on your needs, if you didn't need it, or willing to jump through the hoops, you could still get value along the way.
Internally, when the code hits a yield, it's happening inside the resume method. yield works by saving the current state of the generator in the object (see e.g. resume_from in the post), and returning from resume().
|| {
yield 2;
yield 3;
yield 5;
}
will get converted to a struct that implements the Generator trait[1] with a resume method something like fn resume(self: Pin<&mut Self>) -> GeneratorState<i32, ()> {
match self.next_state {
0 => {
self.next_state = 1;
Yielded(2)
},
1 => {
self.next_state = 2;
Yielded(3)
},
2 => {
self.next_state = 3;
Yielded(5)
},
_ => Complete(())
}
}
Local variables become fields in the struct and if you have more complex control flow, the state could end up jumping around instead of just increasing by one each time. It's nothing that you couldn't write by hand, but it would be very tedious to do so.[1] https://doc.rust-lang.org/beta/unstable-book/language-featur...
But there's plenty of reason to want generators, including the fact that they let you build streams. And the fact that async/await relies heavily on them has pushed the implementation much closer to being ready. I hope we get them at some point!
I think "at some point" is the best answer :)
Good reading list at the bottom of https://areweasyncyet.rs, starting with this post that uses generators as an example: https://boats.gitlab.io/blog/post/2018-01-25-async-i-self-re...