undefined | Better HN

0 pointstom_6mo ago0 comments

Why do they need so many threads? This really feels like they just designed the cpu poorly, in that it can't extract enough parallelism out of the instruction stream already.

(Intel and AMD stopped at 2! Apparently more wasn't worth it for them. Presumably because the cpu was doing enough of the right thing already.)

0 comments

ckozlowski6mo ago

As I recall it, Intel brought about Hyperthreading on Northwood and later Pentium 4s as a way to help with issues in it's long pipeline. As I remember it described at the time, P4 had 30+ stages in it's pipeline. Many of them did not need to be used in a given thread. Furthermore, if a branch prediction engine guessed wrong, then the pipeline needed to be cleared and started anew. For a 30+ stage pipeline, that's a lot of wasted clock cycles.

So hyper-threading was a way to recoup some of those losses. I recall reading at the time that it was a "latency hiding technique". How effective it was I leave to others. But it became standard it seems on all x86 processors in time. Core and Core 2 didn't seem to need it (much shorter pipelines) but later Intel and AMD processors got it.

This is how it was explained to me at the time anyways. I was working at an OEM from '02-'05, and I recall when this feature came out. I pulled out my copy of "Inside the Machine" by Jon Stokes which goes deep into the P4 architecture, but strangely I can only find a single mention of hyperthreading in the book. But it goes far into the P4 architecture and why branch misses are so punishing. It's a good read.

Edit: Adding that I suspect instruction pipelines are not so long that adding additional threads would help. I suspect diminishing returns past 2.

justsomehnguy6mo ago

> As I recall it, Intel brought about Hyperthreading on Northwood and later Pentium 4s as a way to help with issues in it's long pipeline.

Well, Intel brought Hyperthreading to Xeon first and they were quite slow, so the additional thread performance were quite welcome there.

But the GHz race was lead to the monstruosity of 3.06GHz CPUs where the improvement in speed didn't quite translated to the improvement in performance. And while the Northwood fared well (especially considering the disaster of Willamette) GHz/performance wise, the Prescott wasn't and mostly showed the same performance in non-SSE/cache bound tasks[1], so Intel needed to push the GHz further which required a longer pipeline and brought even more penalty on a prediction miss.

Well, at least this is how I remember it.

[0] https://en.wikipedia.org/wiki/List_of_Intel_Xeon_processors_...

[1] but excelled in the room heating, people joked what they even didn't bother with an apartment heating in winter, just leaving a computer running

bcrl6mo ago

The P4's main problem wasn't so much the 30 stage pipeline. It was the fact that the CPU went off into la-la land for 4000+ cycles when replays occurred. The trigger could be anything as simple as a misaligned load or microcode for an 8 byte rep ; movs or a page fault. As a result, the performance of the design was very fragile and too sensitive to extremely small changes in code and data layout, in large part because of how the features to support high clock speeds all fit together. Performance tuning could often get substantial improvements for complex code paths, but the optimizations were frequently useless for other microarchitectures that didn't have the same glass jaws.

Hyperthreading was much less of a concern given that threading of software was only ramping up for mainstream x86.

bee_rider6mo ago

Any time somebody mentions the Pentium 4, it feels like a peek at a time-line we didn’t end up going down. Imagine if Intel had stuck to their guns, maybe they could have pushed through and we’d have CPUs with ridiculous 90 stage pipelines, and like 4 threads per core. Maybe frameworks, languages, and programmer experience would have conspired to help write programs with threads that work together very closely, taking advantage of the shared cache of the hyperthreads.

I mean, it obviously didn’t happen, but it is fun to wonder about.

TristanBall6mo ago

I suspect part of it is licensing games, both in the sense of "avoiding per core license limits" which absolutely matters when your DB is costing a million bucks, and also in the 'enable the highest PVU score per chassis' for ibm's own license farming.

Power systems tend not to be under the same budget constraints as intel, whether thats money, power, heat, whatever, so the cost benifit of adding more sub-core processing for incremental gains is likely different too.

I may have a raft of issues with IBM, and aix, but those Power chips are top notch.

hinkley6mo ago

Yeah that was another thing. You run Oracle you gotta turn that shit off in the BIOS otherwise you're getting charged 2x for 20% more performance.

wmf6mo ago

AFAIK Oracle does not charge extra for SMT.

twoodfin6mo ago

Low-latency databases are architected to be memory-bandwidth bound. SMT allows more connections to be generating more loads faster, utilizing more memory bandwidth.

Think async or green threads, but for memory or branch misses rather than blocking I/O.

(As mentioned elsewhere, optimizing for vendor licensing practices is a nice side benefit, but obviously if the vendors want $X for Y compute on their database, they’ll charge that somehow.)

wmf6mo ago

Power does have higher memory latency because of OMI and it supports more sockets. But I think the main motivation for SMT8 is low-IPC spaghetti code.

j / k navigate · click thread line to collapse

0 comments

ckozlowski6mo ago

Edit: Adding that I suspect instruction pipelines are not so long that adding additional threads would help. I suspect diminishing returns past 2.

justsomehnguy6mo ago

> As I recall it, Intel brought about Hyperthreading on Northwood and later Pentium 4s as a way to help with issues in it's long pipeline.

Well, Intel brought Hyperthreading to Xeon first and they were quite slow, so the additional thread performance were quite welcome there.

Well, at least this is how I remember it.

[0] https://en.wikipedia.org/wiki/List_of_Intel_Xeon_processors_...

[1] but excelled in the room heating, people joked what they even didn't bother with an apartment heating in winter, just leaving a computer running

bcrl6mo ago

Hyperthreading was much less of a concern given that threading of software was only ramping up for mainstream x86.

bee_rider6mo ago

I mean, it obviously didn’t happen, but it is fun to wonder about.

TristanBall6mo ago

I may have a raft of issues with IBM, and aix, but those Power chips are top notch.

hinkley6mo ago

Yeah that was another thing. You run Oracle you gotta turn that shit off in the BIOS otherwise you're getting charged 2x for 20% more performance.

wmf6mo ago

AFAIK Oracle does not charge extra for SMT.

twoodfin6mo ago

Low-latency databases are architected to be memory-bandwidth bound. SMT allows more connections to be generating more loads faster, utilizing more memory bandwidth.

Think async or green threads, but for memory or branch misses rather than blocking I/O.

(As mentioned elsewhere, optimizing for vendor licensing practices is a nice side benefit, but obviously if the vendors want $X for Y compute on their database, they’ll charge that somehow.)

wmf6mo ago

Power does have higher memory latency because of OMI and it supports more sockets. But I think the main motivation for SMT8 is low-IPC spaghetti code.

j / k navigate · click thread line to collapse