The real pitfall is overhead in the standard memory allocator. On ARM v8-A, I bypassed it entirely for my audit engine.
Result: 85ns latency for 10.8T data points on a $100 board.
I recorded the memory profiler and benchmarks as proof since the numbers look 'impossible'. See the video here
https://x.com/NayakaPambudi