If a language allows you:
- to map cleanly with what compilers know to optimize best
- fine control over stack and heap allocation
- has intrinsics / assembly escape hatch
- allows you to specify that pointers don't alias (restrict in C, or default in Fortran)
- gives you prefetching primitives
You will be able to reach hand-tuned Assembly-like performance (and not just C-like performance).
Case in-point, I tuned my own matrix multiplication algorithms in Nim to carefully control register allocations, L1, L2 and L3 cache usage, and vector intrinsics to reach the speed of assembly tuned OpenBLAS and Intel MKL-DNN (no assembly at all):
bench: https://github.com/numforge/laser/blob/c7ddceb0/benchmarks/g...
code: https://github.com/numforge/laser/tree/c7ddceb0/laser/primit...
And matrix multiplication has decades of research and now dedicated hardware (tensor cores, EPU, TPU, NPU, ...) as this is a key algorithm for most numerical workloads.