Agreed. These types of optimizations can yield significant benefits and are often employed in language standard libraries. For example, the Scala standard library employs an analogous optimization in their Set[0] collection type.
0 - https://github.com/scala/scala3/blob/88438e2c6e6204e12666067...
Of course use cases are limited (variable length buffers/strings, etc) since the lifetime of anything on the stack has to match the lifetime of the stack frame (i.e the calling function), but it's super fast since it's just bumping up the stack pointer.
The obvious issue is that you can't know how much space is left on the stack, so you basically have to guess and pick an arbitrary "safe" size limit. This gets even more tricky when functions may be called recursively.
The more subtle issue is that the stack memory returned by alloca() has function scope and therefore you must never call it directly in a loop.
I use alloca() on a regular basis, but I have to say there are safer and better alternatives, depending on the particular use case: arena/frame allocators, threadlocal pseudo-stacks, static vectors, small vector optimizations, etc.
Oh, huh. I've never actually tried it, but I always assumed it would be possible to calculate this, at least for a given OS / arch. You just need 3 quantities, right? `remaining_stack_space = $stack_address - $rsp - $system_stack_size`.
But I guess there's no API for a program to get its own stack address unless it has access to `/proc/$pid/maps` or similar?
This is not a problem for Go, because it has resizable stacks.
>The alloca() function is machine- and compiler-dependent. Because it allocates from the stack, it's faster than malloc(3) and free(3). In certain cases, it can also simplify memory deallocation in applications that use longjmp(3) or siglongjmp(3). Otherwise, its use is discouraged.
Furthermore:
>The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.
You can just as well pass a heap allocated buffer + size around and allocate by incrementing/decrementing size.
Or even better use something like zig's FixedSizeAllocator.
Correct me if I am wrong please
I think what you're referring to is an arena allocator where you allocate a big chunk of memory from the heap, then sequentially sub-allocate from that, then eventually free the entire heap chunk (arena) in one go. Arena allocators are therefore also special use case since they are for when all the sub-allocations have the same (but arbitrary) lifetime, or at least you're willing to defer deallocation of everything to the same time.
So, heap, arena and stack allocation all serve different purposes, although you can just use heap for everything if memory allocation isn't a performance issue for your program, which nowadays is typically the case.
Back in the day when memory was scarce and computers were much slower, another common technique was to keep a reuse "free list" of allocated items of a given type/size, which was faster than heap allocate and free/coalesce, and avoided the heap fragmentation of random malloc/frees.
There's been a least one experiment (posted a few years ago to HN) where someone benchmarked a stackful coroutine implementation with hundreds of thousands (millions?) of stacks that could grow contiguously on-demand up to, e.g., 2MB, but were initially minimally sized and didn't reserve the maximum stack size upfront. The bottleneck was the VMA bookkeeping--the syscalls, exploding the page table, TLB flushing, etc. In principle it could work well and be even more performant than existing solutions, and it might work better today since Linux 6.13's lightweight guard page feature, MADV_GUARD_INSTALL, but we probably still need more architectural support from the system (kernel, if not hardware) to make it performant and competitive with language-level solutions like goroutines, Rust async, etc.
But it's hard to see this as very useful unless we also start to see some increases in legibility, and ways to make sure these optimizations are being used (and that textually minor changes don't cause non-obvious performance regressions).
I've written a lot of golang code that was benchmarked to shreds, and in which we absolutely cared about stack-vs-heap allocations because they were crucial to overall performance. I've spent a lot of time pouring over assembler dumps, because grepping those for indications of new object creation was sometimes clearer (and certainly more definitive) than trying to infer it from the source code level. The one thing I've learned from this?
It's very, very easy for all those efforts to come to naught if the rules change slightly.
And it's very, very, VERY easy for a co-maintainer on a project to stroll in and make seemingly textually trivial changes that have outsized impacts. (I'm looking at inliner thresholds, specifically. Hoo boy.)
The best balm we have for this right now is writing benchmarks and making sure they report zero allocs. (Or unit tests using the runtime memstats harness; potato potato.) But that is a very fragile balm, and relatively complex to maintain, and (if DX is considered) is not textually local to the code in question -- which means someone changing the code can easily miss the criticality of a section (until the tests yell at them, at least).
I really yearn for some markup that can say "I expect this code to contain zero heap allocations; please flunk the compile if that is not the case".
Correct me if I'm wrong, but isn't this a worst-case scenario? realloc can, iirc, extend in place. Your original pointer is still invalid then, but no copy is needed then.
Unless I'm missing something?
Equally, what happens to the ordering of variables on the stack? Is this new one pushed as the last one? Or is there space kept open?
E.g.:
var tasks []task
var other_var intThis is definitionally small-object boxing optimizations.