undefined | Better HN

0 pointsamelius7y ago0 comments

> The reason is that two threads of the same program will often end up executing similar instruction streams

Why is that bad?

0 comments

Your processor has a certain number of execution units which can actually execute individual instructions, maybe like 4 floating point units, maybe 8 arithmetic ones, and maybe 1 that can do vector processing (these numbers are not real, but are like, good enough for sake of message).

So the idea with SMT is that most of the time, lots of the execution units are unused because the thread a) isn't using them at all (e.g. a process to do encryption won't use the floating point units) and/or b) can't use them all because of how the program's written (for example, if I say 'load a random memory address, then add it to a register, then load another random memory address, then add it, etc' I'm going to be spending most of my time waiting for memory to be loaded.

SMT basically means that you run another program at the same time, so even if the encryption process can't use the floating point units, maybe there's another process that we can schedule that will.

However, imagine my encryption process can use 6 of the 8 arithmetic units. If I have 2 encryption processes scheduled on the same core, I have demand for 12 when there are only 8. So now I have contention for resources, and I won't see a speedup from using SMT.

Other comments mention registers and not execution units: I'm suspicious of this, since modern processors have many registers (for Skylake, 250+) which they remap between aggressively as part of pipelining. Maybe this is different for the SIMD units.

That said, I haven't looked at this stuff since university so could well be wrong on the execution unit vs register comparison.

blattimwind7y ago

The contention would actually not be on a per-EU level, but one level higher up. The reservation station has a bunch (~5-8) of ports and typically multiple EUs are connected to one port. Can't use one port for two different things at the same time.

Here's a simplified block diagram of a Skylake core: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

CHY8727y ago

Thanks! Yeah figured I'd be wrong somewhere in there!

1 more reply

gpderetta7y ago

You are pretty much right. A couple of things to add: even hand optimized asm code wont be able to use all ports all the time with a single instruction steam; the biggest win for hyperthreading is filling the pipeline bubbles caused by memory loads out of L1 (there is only so mach that OoO scheduling can do on your average load)

ameliusOP7y ago

So on a RISC architecture, this would happen even for non-similar programs, because the number of instructions is smaller? Or would they just duplicate the processing units?

Dylan168077y ago

The units aren't dedicated to esoteric instructions, you have functions like "small alu", "big alu", "fp adder", "address generator and memory load".

RISC will perform about the same and you can hyperthread one fine.

jfoutz7y ago

Different instruction streams use different registers. It’s like sharing a bathroom. I can shower while you brush your teeth. There’s more contention when we’re both trying to shower.

uryga7y ago

"both are using vector instructions (these registers are shared between the two hyperthreads)"

So I guessing GP meant there's going to be contention for those registers, and thus no speedup?

xeroaura7y ago

Taking a guess, but since they are running similar streams, they have similar loads at a specific time. Competition between main thread and hyperthread could hurt performance instead?

j / k navigate · click thread line to collapse

0 comments

CHY8727y ago

SMT basically means that you run another program at the same time, so even if the encryption process can't use the floating point units, maybe there's another process that we can schedule that will.

That said, I haven't looked at this stuff since university so could well be wrong on the execution unit vs register comparison.

blattimwind7y ago

Here's a simplified block diagram of a Skylake core: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

CHY8727y ago

Thanks! Yeah figured I'd be wrong somewhere in there!

1 more reply

gpderetta7y ago

ameliusOP7y ago

So on a RISC architecture, this would happen even for non-similar programs, because the number of instructions is smaller? Or would they just duplicate the processing units?

Dylan168077y ago

The units aren't dedicated to esoteric instructions, you have functions like "small alu", "big alu", "fp adder", "address generator and memory load".

RISC will perform about the same and you can hyperthread one fine.

jfoutz7y ago

Different instruction streams use different registers. It’s like sharing a bathroom. I can shower while you brush your teeth. There’s more contention when we’re both trying to shower.

uryga7y ago

"both are using vector instructions (these registers are shared between the two hyperthreads)"

So I guessing GP meant there's going to be contention for those registers, and thus no speedup?

xeroaura7y ago

Taking a guess, but since they are running similar streams, they have similar loads at a specific time. Competition between main thread and hyperthread could hurt performance instead?

j / k navigate · click thread line to collapse