Also you need compiler support to correctly handle thread_local.
[1] https://github.com/gpderetta/delimited/blob/master/delimited...
attribute((naked)) on a function which has a single asm block as the implementation gives you control over argument passing and changing the stack pointer.
attribute((preserve_none)) on the same function spills most live registers to the stack in the caller. The coroutine switch doesn't need to do as many push/pop which makes it a bit more readable, but mainly this means you don't spill dead registers. That's the big thing you need compiler support for.
I believe the x64 redzone is a non-issue here as you've called the switch function, as opposed to tried to call from within inline asm (which does need to be careful about that). The magic globals are a problem though (floating point control thing, maybe signal mask, errno et al) so I guess don't use the magic globals from within fibres.
"thread_local" doesn't map very sensibly onto fibres. There have been compiler bugs in that area too. Storing some information at the start of the fibre stack works fine though, you just don't get syntactic support for allocating / dereferencing from it.
Ideally I think that a ctx_t* __builtin_context_switch(ctx_t* to) would need to be provided by the compiler.
Re thread_local, I believe at least MSVC has (had?) a fiber-safe flag that would handle thread_locals correctly by not caching addresses across function calls.