You don't need a fat runtime to do fibers/stackful coroutines. You don't need any language support for that matter, just 50 lines of assembly to save registers on the stack and switch stack pointers. Minicoro [1] is a C library that implements fibers in a single header (just the creation/destruction/context switching, you have to bring your own scheduler).
Our game engine has a in-house implementation - creating a fiber, scheduling it, and waiting for it to complete takes ~300ns on my box. Creating a OS thread and join()ing is just about 1000 slower, ~300us.
[1] https://github.com/edubart/minicoro