And the last generation was wider and deeper than the one before it, also costing power and area.
The question that should be asked ... but which would never be answered ... is "What was it that you changed that REQUIRED and ALLOWED you to go wider and deeper?"
It's not a new process node every time.
Theres no NEED to have a massive reorder buffer unless you can decode and dispatch that number of instructions in the time it takes for a load to arrive from whichever level of memory hierarchy you're optimising for. And there's no POINT if you're often going to get a misprediction in that number of instructions. Ok, so wider decode is one component of that. Is there a difference in memory latency as well? Wider decode past 3 or 4 instructions increasingly means that you can't just end your packet of decoded instructions at the first branch -- as you get wider you're increasingly going to have to both parse past a conditional branch, and then have to predict more than one branch in the same decode cycle. You'll also get into branches that jump to other instructions in the same decode group (either forward or backward).
There are all kinds of complications there, with no doubt interesting solutions, that go far beyond "we went wider and deeper".
No comments yet.