Some of these old optimizations are now deprecated. For example, there’s a famous trick for inverse square root:
https://en.wikipedia.org/wiki/Fast_inverse_square_root Modern processors have a special instruction for that. The hardware instruction is several times faster, and couple orders of magnitude more precise:
https://www.felixcloutier.com/x86/rsqrtpsOther optimizations are now applied automatically by compilers. For example, all modern compilers optimize integer division by compile-time constants, here’s an example: https://godbolt.org/z/1b8r5c5MG
Squeezing performance out of modern hardware requires doing very different things.
Here’s an example about numerical computations. On paper, each core of my CPU can do 64 single-precision FLOPs each cycle. In reality, to achieve that performance a program needs to spam _mm256_fmadd_ps instructions while only loading at most 1 AVX vector per FMA, and only storing at most 1 AVX vector per two FMAs.