Rust zero-cost abstractions vs. SIMD (opens in new tab)

(turbopuffer.com)

24 pointsSirupsen2mo ago4 comments

4 comments

The real pitfall is overhead in the standard memory allocator. On ARM v8-A, I bypassed it entirely for my audit engine. Result: 85ns latency for 10.8T data points on a $100 board. I recorded the memory profiler and benchmarks as proof since the numbers look 'impossible'. See the video here

https://x.com/NayakaPambudi

A04eArchitect2mo ago

Actually, the bottleneck wasn't the I/O, it was the context switching. If anyone wants the specific memory map addresses I used for the ARM v8-A bypass, let me know

verglasz2mo ago

Sounds like the cost isn't really in the abstraction, but in implementing a traversal of the merge tree which produced one value at a time instead of creating a batch with what is presumably fewer total wasted computations... I doubt that they'd have had better codegen if they inlined their `next()` into the loop consuming the values. And vice versa, probably an `Iterator` for the merge tree that internally produces a batch and then yields from it would probably perform pretty much the same as their current code (since it's thin enough to be inlined I expect).

jason_s2mo ago

Can we please encourage variable-width fonts for text, fixed-width fonts for code? It improves readability.

j / k navigate · click thread line to collapse

4 comments

A04eArchitect2mo ago

https://x.com/NayakaPambudi

A04eArchitect2mo ago

Actually, the bottleneck wasn't the I/O, it was the context switching. If anyone wants the specific memory map addresses I used for the ARM v8-A bypass, let me know

verglasz2mo ago

jason_s2mo ago

Can we please encourage variable-width fonts for text, fixed-width fonts for code? It improves readability.

j / k navigate · click thread line to collapse