undefined | Better HN

0 pointsashvardanian1y ago0 comments

That may have been my mistake. I use super & hyper interchangeably and don't always notice :)

PS: Should be an easy patch, will update!

0 comments

Maybe not.

Superscalar is when say... Think of the following assembly code.

   Add r1, r2
   Sub r3, r4

And the add and subtract both happen on the same clock tick. The important thing is that a modern CPU core (and even GPU core) have multiple parallel ALU pipelines inside of them.

Because r1, r2, r3 and r4 are fully independent, a modern CPU can detect the potential parallelism here and act in parallel. After CPUs mastered this trick, the next out of order processors were invented (which not only allowed for super scalar operations, but allowed the subtract to execute first if for some reason the CPU core were waiting on r1 or r2).

There are a ton of ways that modern CPUs and GPUs extract parallelism from seemingly nothingness. And because all the techniques are independent, we can have superscalar out-of-order SIMD (like what happens in AVX512 in practice). SIMD is... SIMD. It's one instruction applied to lots of data in parallel. It's totally different.

You really need to use the correct word for the specific kind of parallelism that you are trying to highlight. I expect that the only word that makes sense in this article is SIMD.

bjourne1y ago

Agree with this. Calling SIMD superscalar is a misnomer since it is single instruction (multiple data) with very wide data paths. Superscalar implies multiple different instructions in parallel, such as adding a pair of numbers, while subtracting another pair (or even dividing).

pyrolistical1y ago

I wish hardware exposed an api that allowed us to submit a tree of instructions so the hardware doesn’t need figure out which instructions are independent.

Lots of this kind of work can be done during compilation but cannot be communicated to hardware due to code being linear

dragontamer1y ago

That's called VLIW and Intel Itanium is considered one of the biggest chip failures of all time.

There is an argument that today's compilers are finally good enough for VLIW to go mainstream, but good luck convincing anyone in today's market to go for it.

------

A big problem with VLIW is that it's impossible to predict L1, L2, L3 or DRAM access. Meaning all loads/stores are impossible to schedule by the compiler.

NVidia has interesting barriers that get compiled into its SASS (a level lower than PTX assembly). These barriers seem to allow the compiler to assist in the dependency management process but ultimately still require a decoder in the NVidia core final level before execution.

2 more replies

j / k navigate · click thread line to collapse

0 comments

dragontamer1y ago

Maybe not.

Superscalar is when say... Think of the following assembly code.

   Add r1, r2
   Sub r3, r4

And the add and subtract both happen on the same clock tick. The important thing is that a modern CPU core (and even GPU core) have multiple parallel ALU pipelines inside of them.

You really need to use the correct word for the specific kind of parallelism that you are trying to highlight. I expect that the only word that makes sense in this article is SIMD.

bjourne1y ago

pyrolistical1y ago

I wish hardware exposed an api that allowed us to submit a tree of instructions so the hardware doesn’t need figure out which instructions are independent.

Lots of this kind of work can be done during compilation but cannot be communicated to hardware due to code being linear

dragontamer1y ago

That's called VLIW and Intel Itanium is considered one of the biggest chip failures of all time.

There is an argument that today's compilers are finally good enough for VLIW to go mainstream, but good luck convincing anyone in today's market to go for it.

------

A big problem with VLIW is that it's impossible to predict L1, L2, L3 or DRAM access. Meaning all loads/stores are impossible to schedule by the compiler.

2 more replies

j / k navigate · click thread line to collapse