undefined | Better HN

0 pointsobservationist5d ago0 comments

I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

If model A reaches performance level 100 using 100 units of compute using old methods, and you train model B using AttnRes, aiming at performance level 100, it costs you 80 units of compute.

It probably doesn't map precisely, but that's where people are diverging from the claim - it doesn't explicitly say anything about reduced inference or training time, but that's the implicit value of these sorts of things. Less compute to equivalent performance can be a huge win for platforms at scale as well as for local models.

0 comments

dvt5d ago

> I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

This is not what they're getting at; I explained exactly what they're getting at. I mean, your equivalence of "loss" (what authors actually measured) and "performance" is just bizarre. We use benchmarks to measure performance, and the numbers there were like 1-5% better (apart from the GPQA-Diamond outlier).

Do people even read these papers?

jszymborski4d ago

> Do people even read these papers?

Overwhelmingly, no. You may have mistaken this for a lab's reading group, but most people here just skim the README, maybe read the abstract or figures. Expecting them to do more is uh... a bit strange?

But also you can forgive people for equating loss with performance, which are admittedly different but related ideas.

j / k navigate · click thread line to collapse