First of all, you're only benchmarking the time it takes for fork(2) to return in the parent subshell, nothing else. The new processes don't exist yet at this point, and certainly hasn't exec'd (which tends to be why you're forking).
Second, you're not measuring the cost at all. The forked children will, at some point, start executing on other CPUs, which includes finishing configuration and running exec, which takes time. The cost is the total cycles it takes before the child is executing the intended code.
Fork is damn expensive, but whether they're too expensive depends on the usecase, and the cost of expanding hardware.
Fork time scales with the virtual memory of the forking process, and you're forking from a fresh subshell that hardly has anything allocated. It's even mentioned in the linked post that their issue stemmed from this (specifically fork lock contention spiking as fork time increased).