Now, the other approach, is that you have a CPU with out of order (OoO) execution. Meaning that the CPU contains a scheduler that handles a queue of instructions, and any instruction that has all its dependencies satisfied can be submitted for execution. And then later on a bunch of magic happens so that externally to the CPU it still looks like everything was executed in order like the program code specified. This is pretty good for getting good single thread performance, and can exploit some amount of MLP as well, e.g. if a bunch of instructions are waiting for a memory operation to complete, some other instructions can still proceed (perhaps executing a memory op themselves). So in this model the amount of MLP is limited by the inherent serial dependencies in the code, and on the length of the instruction queues that the scheduler maintains. The downside of this is that the OoO logic takes up quite a bit of chip area (making it more expensive), and also tends to be one of the more power-hungry parts of the chip. But, if you want good single-thread performance, that's the price you have to pay.. Anyway, now that you have this OoO CPU, what about adding hardware threads? Well, now that you already have all this scheduling logic, turns out it's relatively easy. Just "tag" each instruction with a thread ID, and let the scheduler sort it all out. So this is what is called Simultaneous Multi-Threading (SMT). So in a way it's a pretty different way of doing threading compared to the Niagara-style in-order processor. Also, since you already have all this OoO logic that is able to exploit some MLP within each thread, you don't need as many threads as the Niagara-style CPU to saturate the memory subsystem. So, this SMT style of threading is what you see in contemporary Intel x86 processors (they call it hyperthreading (HT)), IBM POWER, and now also AMD Zen cores.
As for benchmarks, I'm too lazy to search, but I'm sure you can find e.g. some speccpu results for Niagara.
So although separated by time but not by clocks (the intel setup has the roughly the same base clocks and the same ram as the t4 setup) the 40 thread Xeon system had roughly double the perf of the 128 thread t4 setup running speccjvm2008 https://www.spec.org/jvm2008/results/jvm2008.html