undefined | Better HN

0 pointsvardump10y ago0 comments

Well, while I see the advantages of manual memory management as an embedded and kernel driver developer, one should understand manual memory allocation and freeing are very expensive and unpredictable operations. Very, very far from "zero overhead". There's usually no way you can afford to allocate memory while processing an interrupt request! It's simply too unpredictably slow. (Not to mention many synchronization mechanisms memory management usually requires, like mutexes, are simply not feasible at IRQ level. No process switching is possible, so a deadlock occurs.)

But some GC schemes could be fast enough even from an IRQ handler, if memory allocation is just something simple like an atomic add to top of heap pointer. As long as non-interrupt level routines, potentially running on an another core, have enough time to clean up the garbage.

Manual memory management seems to be at the end of its road when it comes to high core count systems, with tens or hundreds of CPU cores. Allocation and application side object lifetime synchronization and management will simply saturate any inter-core communication mechanism, limiting scalability. GC should be able to get around that limitation. At that scale, you could already dedicate one or more cores just for cleaning garbage.

0 comments

vardumpOP10y ago

For those who still think manual memory management and especially malloc and free are fast, check glibc implementation.

https://fossies.org/dox/glibc-2.21/malloc_8c_source.html#l02...

Also see this benchmark between different, faster allocators: http://locklessinc.com/benchmarks_allocator.shtml

The extra work doesn't end at allocation. When you actually implement a concurrent system, you'll usually end up having corner cases at object lifetime changes, which you need to synchronize. If the memory is only claimed when there are no more references to it, this extra synchronization step can be avoided. If you can't rely on this, you'll probably end up doing synchronization, such as (atomic) reference counting.

Synchronization is very expensive and it can quickly become the performance bottleneck for the whole system. On modern X86, you can do 5-20k floating point operations during one contended atomic sync op. Reference count increase or decrease is one sync op. A simple mutex needs two of those.

The more you have CPU cores, the more there will be synchronization (cache coherence) traffic broadcasted to all cores.

Words like "JIT" and "GC" seem to cause knee-jerk reactions in some developers. Likewise for manual memory management. It's not so black and white. There'll always be trade-offs. I usually write low level (firmware and kernel driver) and high performance code. C/C++/SIMD. Code that might need to react under a microsecond.

My message is just please be more open minded.

Analyze where your code spends its execution time. You might be surprised how much of it is spent in things like C++ streams, xprintfs and memory allocation. Unfortunately inter-core synchronization is more insidious. It's not visible in benchmarks on small systems. Often you only start to see hints of this problem when actually running on more cores. Enough of them, and that's all your code is doing.

yoklov10y ago

Sure, malloc/free are slow, but in realtime code (and probably embedded too, but I've never worked there) you allocate out of arena to get around that.

Suddenly your memory allocation routine is a pointer addition, and you don't need garbage collection either. Deallocation is just as fast as well, it just happens in one block.

It requires a bit of care in the code using it to not grow memory in an unbounded fashion, but this isn't really hard once you get used to it.

vardumpOP10y ago

> Suddenly your memory allocation routine is a pointer addition, and you don't need garbage collection either. Deallocation is just as fast as well, it just happens in one block.

That's one of the techniques I apply. I have also some other tricks with different trade-offs up in my sleeve.

This is primitive garbage collection. Because compaction / sweeping phase is simply discard all, no marking phase is required.

Sometimes you get a lot of interrupts in a sequence, and if the lower priority code didn't have a chance to clean up the arena, you lose data.

Interrupts can also occur at any point, unless you disable them. But you can't keep them disabled for very long. Maybe long enough that you check IRQ handler is not currently running on another core. If not, swap an arena pointer IRQ routine uses, enable interrupts again and start to process the previously pointed arena buffer.

> It requires a bit of care in the code using it to not grow memory in an unbounded fashion, but this isn't really hard once you get used to it.

This. The babysitting code and care you need for this technique.

j / k navigate · click thread line to collapse

0 comments

vardumpOP10y ago

For those who still think manual memory management and especially malloc and free are fast, check glibc implementation.

https://fossies.org/dox/glibc-2.21/malloc_8c_source.html#l02...

Also see this benchmark between different, faster allocators: http://locklessinc.com/benchmarks_allocator.shtml

The more you have CPU cores, the more there will be synchronization (cache coherence) traffic broadcasted to all cores.

My message is just please be more open minded.

yoklov10y ago

Sure, malloc/free are slow, but in realtime code (and probably embedded too, but I've never worked there) you allocate out of arena to get around that.

Suddenly your memory allocation routine is a pointer addition, and you don't need garbage collection either. Deallocation is just as fast as well, it just happens in one block.

It requires a bit of care in the code using it to not grow memory in an unbounded fashion, but this isn't really hard once you get used to it.

vardumpOP10y ago

> Suddenly your memory allocation routine is a pointer addition, and you don't need garbage collection either. Deallocation is just as fast as well, it just happens in one block.

That's one of the techniques I apply. I have also some other tricks with different trade-offs up in my sleeve.

This is primitive garbage collection. Because compaction / sweeping phase is simply discard all, no marking phase is required.

Sometimes you get a lot of interrupts in a sequence, and if the lower priority code didn't have a chance to clean up the arena, you lose data.

> It requires a bit of care in the code using it to not grow memory in an unbounded fashion, but this isn't really hard once you get used to it.

This. The babysitting code and care you need for this technique.

j / k navigate · click thread line to collapse