AI engineers claim new algorithm reduces AI power consumption by 95% (opens in new tab)

(tomshardware.com)

370 pointsferriswil1y ago170 comments

170 comments

ABSTRACT

Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication (L-Mul) algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8 e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8 e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8 e4m3 as accumulation precision in both fine-tuning and inference.

onlyrealcuzzo1y ago

Does this mean you can train efficiently without GPUs?

Presumably there will be a lot of interest.

crazygringo1y ago

No. But it does potentially mean that either current or future-tweaked GPUs could run a lot more efficiently -- meaning much faster or with much less energy consumption.

You still need the GPU parallelism though.

2 more replies

pnt121y ago

The GPU main advantage is its parallelism - thousands of cores compared handful of cores in CPUs.

If you're training models with billions of parameters, you're still gonna need that.

etcd1y ago

I feel like I have seen this idea a few times but don't recall where but stuff posted via HN.

Here https://news.ycombinator.com/item?id=41784591 but even before that. It is possibly one of those obvious ideas to people steeped in this.

To me intuitively using floats to make ultimatelty boolean like decisions seems wasteful but that seemed like the way it had to be to have diffetentiable algorithms.

yogrish1y ago

we used to use Fixed point multiplications (Q Format) in DSP algorithms on different DSP architectures. https://en.wikipedia.org/wiki/Q_(number_format). They used to be so fast and near accurate to floating point multiplications. Probably we need to use those DSPs blocks as part of Tensors/GPUs to realise both fast multiplications & parallelisms.

mvkel1y ago

Is this effectively quantizing without actually quantizing?

jart1y ago

It's a very crude approximation, e.g. 1.75 * 2.5 == 3 (although it seems better as the numbers get closer to 0).

I tried implementing this for AVX512 with tinyBLAS in llamafile.

    inline __m512 lmul512(__m512 x, __m512 y) {
        __m512i sign_mask = _mm512_set1_epi32(0x80000000);
        __m512i exp_mask = _mm512_set1_epi32(0x7F800000);
        __m512i mant_mask = _mm512_set1_epi32(0x007FFFFF);
        __m512i exp_bias = _mm512_set1_epi32(127);
        __m512i x_bits = _mm512_castps_si512(x);
        __m512i y_bits = _mm512_castps_si512(y);
        __m512i sign_x = _mm512_and_si512(x_bits, sign_mask);
        __m512i sign_y = _mm512_and_si512(y_bits, sign_mask);
        __m512i exp_x = _mm512_srli_epi32(_mm512_and_si512(x_bits, exp_mask), 23);
        __m512i exp_y = _mm512_srli_epi32(_mm512_and_si512(y_bits, exp_mask), 23);
        __m512i mant_x = _mm512_and_si512(x_bits, mant_mask);
        __m512i mant_y = _mm512_and_si512(y_bits, mant_mask);
        __m512i sign_result = _mm512_xor_si512(sign_x, sign_y);
        __m512i exp_result = _mm512_sub_epi32(_mm512_add_epi32(exp_x, exp_y), exp_bias);
        __m512i mant_result = _mm512_srli_epi32(_mm512_add_epi32(mant_x, mant_y), 1);
        __m512i result_bits = _mm512_or_si512(
            _mm512_or_si512(sign_result, _mm512_slli_epi32(exp_result, 23)), mant_result);
        return _mm512_castsi512_ps(result_bits);
    }

Then I used it for Llama-3.2-3B-Instruct.F16.gguf and it outputted jibberish. So you would probably have to train and design your model specifically to use this multiplication approximation in order for it to work. Or maybe I'd have to tune the model so that only certain layers and/or operations use the approximation. However the speed was decent. Prefill only dropped from 850 tokens per second to 200 tok/sec on my threadripper. Prediction speed was totally unaffected, staying at 34 tok/sec. I like how the code above generates vpternlog ops. So if anyone ever designs an LLM architecture and releases weights on Hugging Face that use this algorithm, we'll be able to run them reasonably fast without special hardware.

raluk1y ago

Your kernel seems to be incorrect for 1.75 * 2.5. From paper we have 1.75 == (1+0.75)*2^0 for 2.5 == (1+0.25)*2^1 so result is (1+0.75+0.25+2^-4)*2^1 == 4.125 (correct result is 4.375)

raluk1y ago

Extra. I am not sure if that is clear from paper, but in example of 1.75 * 2.5 we can represent 1.75 also as (1-0.125) * 2. This gives good aproximations for numbers that are close but less than power of 2. This way abs(a*b) in (1+a)*(1+b) is allways small and strictly less than 0.25.

Another example, if we have for example 1.9 * 1.9 then we need to account for overflow in (0.9 + 0.9) and this seems to induce similar overhead as expressing numbers as (1-0.05)*2 .

kayo_202110301y ago

Extraordinary claims require extraordinary evidence. Maybe it's possible, but consider that some really smart people, in many different groups, have been working diligently in this space for quite a while; so claims of 95% savings on energy costs _with equivalent performance_ is in the extraordinary category. Of course, we'll see when the tide goes out.

manquer1y ago

It is a click bait headline the claim itself is not extraordinary. the preprint from arxiv was posted here some time back .

The 95% gains is specifically only for multiplication operations, inference is compute light and memory heavy in the first place so the actual gains would be far less smaller .

Tech journalism (all journalism really) can hardly be trusted to publish grounded news with the focus on clicks and revenue they need to survive.

ksec1y ago

>Tech journalism (all journalism really) can hardly be trusted to publish grounded news with the focus on clicks and revenue they need to survive.

Right now the only way to gain real knowledge is actually to read comments of those articles.

kayo_202110301y ago

Thank you. That makes sense.

rob_c1y ago

Bingo,

We have a winner. Glad that came from someone not in my lectures on ML network design

Honestly, thanks for beeting me to this comment

throwawaymaths1y ago

I don't think this claim is extraordinary. Nothing proposed is mathematically impossible or even unlikely, just a pain in the ass to test (lots of retraining, fine tuning etc, and those operations are expensive when you dont have already massively parallel hardware available, otherwise you're ASIC/FPGAing for something with a huge investment risk)

If I could have a SWAG at it I would say a low resolution model like llama-2 would probably be just fine (llama-2 quantizes without too much headache) but a higher resolution model like llama-3 probably not so much, not without massive retraining anyways.

Randor1y ago

The energy claims up to ~70% can be verified. The inference implementation is here:

https://github.com/microsoft/BitNet

kayo_202110301y ago

I'm not an AI person, in any technical sense. The savings being claimed, and I assume verified, are on ARM and x86 chips. The piece doesn't mention swapping mult to add, and a 1-bit LLM is, well, a 1-bit LLM.

Also,

> Additionally, it reduces energy consumption by 55.4% to 70.0%

With humility, I don't know what that means. It seems like some dubious math with percentages.

2 more replies

littlestymaar1y ago

How does the liked article relate to BitNet at all? It's about the “addition is all you need” paper which AFAIK is unrelated.

1 more reply

vlovich1231y ago

They’ve been working on unrelated problems like structure of the network or how to build networks with better results. There have been people working on improving the efficiency of the low-level math operations and this is the culmination of those groups. Figuring this stuff out isn’t super easy.

kayo_202110301y ago

re: all above/below comments. It's still an extraordinary claim.

I'm not claiming it's not possible, nor am I claiming that it's not true, or, at least, honest.

But, there will need to be evidence that using real machines, and using real energy an _equivalent performance_ is achievable. A defense that "there are no suitable chips" is a bit disingenuous. If the 95% savings actually has legs some smart chip manufacturer will do the math and make the chips. If it's correct, that chip making firm will make a fortune. If it's not, they won't.

throwawaymaths1y ago

> If the 95% savings actually has legs some smart chip manufacturer will do the math and make the chips

Terrible logic. By a similar logic we wouldn't be using python for machine learning at all, for example (or x86 for compute). Yet here we are.

1 more reply

stefan_1y ago

I mean, all these smart people would rather pay NVIDIA all their money than make AMD viable. And yet they tell us its all MatMul.

dotnet001y ago

It's not their job to make AMD viable, it's AMD's job to make AMD viable. NVIDIA didn't get their position for free, they spent a decade refining CUDA and its tooling before GPU-based crypto and AI kicked off.

kayo_202110301y ago

Both companies are doing pretty well. Why don't you think AMD is viable?

2 more replies

jhj1y ago

As someone who has worked in this space (approximate compute) on both GPUs and in silicon in my research, the power consumption claims are completely bogus, as are the accuracy claims:

> In this section, we show that L-Mul is more precise than fp8 e4m3 multiplications

> To be concise, we do not consider the rounding to nearest even mode in both error analysis and complexity estimation for both Mul and L-Mul

These two statements together are non-sensical. Sure, if you analyze accuracy while ignoring the part of the algorithm that gives you accuracy in the baseline you can derive whatever cherry-picked result you want.

The multiplication of two floating point values if you round to nearest even will be the correctly rounded result of multiplying the original values at infinite precision, this is how floating point rounding usually works and what IEEE 754 mandates for fundamental operations if you choose to follow those guidelines (e.g., multiplication here). But not rounding to nearest even will result in a lot more quantization noise, and biased noise at that too.

> applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products

A good chunk of the energy cost is simply moving data between memories (especially external DRAM/HBM/whatever) and along wires, buffering values in SRAMs and flip-flops and the like. Combinational logic cost is usually not a big deal. While having a ton of fixed-function matrix multipliers does raise the cost of combinational logic quite a bit, at most what they have will probably cut the power of an overall accelerator by 10-20% or so.

> In this section, we demonstrate that L-Mul can replace tensor multiplications in the attention mechanism without any loss of performance, whereas using fp8 multiplications for the same purpose degrades inference accuracy

I may have missed it in the paper, but they have provided no details on (re)scaling and/or using higher precision accumulation for intermediate results as one would experience on an H100 for instance. Without this information, I don't trust these evaluation results either.

_aavaa_1y ago

Original discussion of the preprint: https://news.ycombinator.com/item?id=41784591

codethief1y ago

Ahh, there it is! I was sure we had discussed this paper before.

remexre1y ago

Isn't this just taking advantage of "log(x) + log(y) = log(xy)"? The IEEE754 floating-point representation stores floats as sign, mantissa, and exponent -- ignore the first two (you quantitized anyway, right?), and the exponent is just an integer storing log() of the float.

mota71y ago

Not quite: It's taking advantage of (1+a)(1+b) = 1 + a + b + ab. And where a and b are both small-ish, ab is really small and can just be ignored.

So it turns the (1+a)(1+b) into 1+a+b. Which is definitely not the same! But it turns out, machine guessing apparently doesn't care much about the difference.

amelius1y ago

You might then as well replace the multiplication by the addition in the original network. In that case you're not even approximating anything.

Am I missing something?

1 more reply

tommiegannert1y ago

Plus the 2^-l(m) correction term.

Feels like multiplication shouldn't be needed for convergence, just monotonicity? I wonder how well it would perform if the model was actually trained the same way.

dsv3099i1y ago

This trick is used a ton when doing hand calculation in engineering as well. It can save a lot of work.

You're going to have tolerance on the result anyway, so what's a little more error. :)

convolvatron1y ago

yes. and the next question is 'ok, how do we add'

kps1y ago

Yes. I haven't yet read this paper to see what exactly it says is new, but I've definitely seen log-based representations under development before now. (More log-based than the regular floating-point exponent, that is. I don't actually know the argument behind the exponent-and-mantissa form that's been pretty much universal even before IEEE754, other than that it mimics decimal scientific notation.)

dietr1ch1y ago

I guess that if the bulk of the computation goes into the multiplications, you can work in the log-space and simply sum, and when the time comes to actually do a sum on the original space you can go back and sum.

1 more reply

robomartin1y ago

I posted this about a week ago:

https://news.ycombinator.com/item?id=41816598

This has been done for decades in digital circuits, FPGA’s, Digital Signal Processing, etc. Floating point is both resource and power intensive and using FP without the use of dedicated FP processing hardware is something that has been avoided and done without for decades unless absolutely necessary.

fidotron1y ago

Right, the ML people are learning, slowly, about the importance of optimizing for silicon simplicity, not just reduction of symbols in linear algebra.

Their rediscovery of fixed point was bad enough but the “omg if we represent poses as quaternions everything works better” makes any game engine dev for the last 30 years explode.

ausbah1y ago

a lot of things in the ML research space are rebranding an old concept w a new name as “novel”

ujikoluk1y ago

Explain more for the uninitiated please.

robomartin1y ago

Not sure there's much to explain. Using integers for math in digital circuits is far more resource and computationally efficient than floating-point math. It has been decades since I did the math on the difference. I'll just guess that it could easily be an order of magnitude better across both metrics.

At basic level it is very simple: A 10 bit bus gives you the ability to represent numbers between 0 and 1 with a resolution of approximately 0.001. 12 bits would be four times better. Integer circuits can do the math in one clock cycle. Hardware multipliers do the same. To rescale the numbers after multiplication you just take the N high bits, where N is your bus width; which is a zero clock-cycle operation. Etc.

In training a neural network, the back propagation math can be implemented using almost the same logic used for a polyphase FIR filter.

didgetmaster1y ago

Maybe I am just a natural skeptic, but whenever I see a headline that says 'method x reduces y by z%'; but when you read the text it instead says that optimizing some step 'could potentially reduce y by up to z%'; I am suspicious.

Why not publish some actual benchmarks that prove your claim in even a few special cases?

dragonwriter1y ago

Well, one, because the headline isn't from the researchers, its from a popular press report (not even the one posted here, originally, this is secondary reporting of another popular press piece) and isn't what the paper claims so it would be odd for the paper's authors to conduct benchmarks to justify it. (And, no, even the "up to 95%" isn't from the paper, the cost savings are cited per operation depending on operation and the precision the operation is conducted at, are as high as 97.3%, are based on research already done establishing the energy cost of math operations on modern compute hardware, but no end-to-end cost savings claim is made.)

And, two, because the actual energy cost savings claimed aren't even the experimental question -- the energy cost differences between various operations on modern hardware have been established in other research, the experimental issue here was whether the mathematical technique that enables using the lower energy cost operations performs competitively on output quality with existing implementations when substituted in for LLM inference.

baq1y ago

OTOH you have a living proof that an amazingly huge neural network can work on 20W of power, so expecting multiple orders of magnitude in power consumption reduction is not unreasonable.

etcd1y ago

Mitochondria are all you need.

Should be able to go more efficient as the brain has other constraints such as working at 36.7 degrees C etc.

andrewstuart1y ago

https://github.com/microsoft/BitNet

"The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. More details will be provided soon."

jdiez171y ago

Damn. Seems almost too good to be true. Let’s see where this goes in two weeks.

1 more reply

TheRealPomax1y ago

Because as disappointing as modern life is, you need clickbait headlines to drive traffic. You did the right thing by reading the article though, that's where the information is, not the title.

phtrivier1y ago

Fair enough, but then I want a way to penalize publishers for abusing clickbait. There is no "unread" button, and there is no way to unsubscribe to advertisement-based sites.

Even on sites that have a "Like / Don't like" button, my understanding is that clicking "Don't like" is a form of "engagement", that the suggestion algorithm are going to reward.

Give me a button that says "this article was a scam", and have the publisher give the advertisement money back. Of better yet, give the advertisement money to charity / public services / whatever.

Take a cut of the money being transfered, charge the publishers for being able to get a "clickbait free" green mark if they implement the scheme.

Track the kind of articles that generate the most clickbait-angry comment. Sell back the data.

There might a business model.

1 more reply

keybored1y ago

Headlines: what can they do, they need that for the traffic

Reader: do the moral thing and read the article, not just the title

How is that balanced.

GistNoesis1y ago

Does https://en.wikipedia.org/wiki/Jevons_paradox apply in this case ?

mattxxx1y ago

That's interesting.

Obviously, energy cost creates a barrier to entry, so reduction of cost reduces the barrier to entry... which adds more players... which increases demand.

bicepjai1y ago

This is why I love HN

gosub1001y ago

Not necessarily a bad thing: this might give the AI charlatans enough time to actually make something useful.

narrator1y ago

Of course. Jevons paradox always applies.

holoduke1y ago

I don't think algorithms will change energy consumption. There is always max capacity needed in terms of computing. If tomorrow a new algorithm increases the performance 4 times, we will just have 4 times more computing.

Art96811y ago

In the end the power consumption means the current models that are "good enough" will fit a much smaller compute budget such as edge devices. However, enthusiasts are still going to want the best hardware they can afford because inevitably, everyone will want to maximize the size and intelligence of a model they can run. So we're just going to scale. This might bring a GPT-4 level to edge devices, but we are still going to want to run what might resemble a GPT-5/6 model on the best hardware possible at the time. So don't throw away your GPU's yet. This will bring capabilities to mass market, but your high end GPU will still scale the solution n-fold and youll be able to run models with disregard to the energy savings promoted in the headline.

In other sensationalized words: "AI engineers can claim new algorithm allows them to fit GPT-5 in an RTX5090 running at 600 watts."

gcanyon1y ago

This isn't really the optimization I'm think about, but: given the weird and abstract nature of the functioning of ML in general and LLMs in particular, it seems reasonable to think that there might be algorithms that achieve the same, or a similar, result in an orders-of-magnitude more efficient way.

greenthrow1y ago

The trend of hyping up papers too early on is eroding people's faith in science due to poor journalism failing to explain that this is theoretical. The outlets that do this should pay the price but they don't, because almost every outlet does it.

panosv1y ago

Lemurian Labs looks like it's doing something similar: https://www.lemurianlabs.com/technology They use the Logarithmic Number System (LNS)

ein0p1y ago

As a rule, compute only takes less than 10% of all energy. 90% is data movement.

idiliv1y ago

Duplicate, posted on October 9: https://news.ycombinator.com/item?id=41784591

hello_computer1y ago

How does this differ from Cussen & Ullman?

https://arxiv.org/abs/2307.01415

selimthegrim1y ago

Cussen is an HN poster incidentally.

littlestymaar1y ago

Related: https://news.ycombinator.com/item?id=41784591 10 days ago

andrewstuart1y ago

Here is the Microsoft implementation:

https://github.com/microsoft/BitNet

syntaxing1y ago

I’m looking forward to Bitnet adaptation. MS just released a tool for it similar to llamacpp. Really hoping major models get retrained for it.

creativenolo1y ago

Simple question: if true, would power consumption stay at 100% because we’d work the algorithm harder?

I had assumed the latency etc were based on what was desirable for the use case and hardware, rather than power consumption.

asicsarecool1y ago

Don't assume this isn't already in place at the main AI companies

svilen_dobrev1y ago

i am not well versed in the math involved, but IMO if the outcome depends mostly on the differences between them numbers, as smaller-or-bigger distinction as well as their magnitudes, then exactness might not be needed. i mean, as long as the approximate "function" looks similar to the exact one, that might be good enough.

Maybe even generate a table of the approximate results and use that, in various stages? Like the way sin/cos was done 30y ago before FP coprocessors arrived

m4631y ago

So couldn't you design a GPU that uses or supports this algorithm to use the same power, but use bigger models, better models, or do more work?

DennisL1231y ago

This is a result on 8 bit numbers, right? Why not precompute all 64k possible combinations and look up the results from the table?

andrewstuart1y ago

The ultimate “you’re doing it wrong”.

For he sake of the climate and environment it would be nice to be true.

Bad news for Nvidia. “Sell your stock” bad.

Does it come with a demonstration?

mouse_1y ago

Hypothetically, if this is true and simple as the headline implies -- AI using 95% less power doesn't mean AI will use 95% less power, it means we will do 20x more AI. As long as it's the current fad, we will throw as much power and resources at this as we can physically produce, because our economy depends on constant, accelerating growth.

etcd1y ago

True. A laptop power pack wattage is probably pretty much unchanged over 30 years for example.

Dylan168071y ago

Bad news for Nvidia how? Even ignoring that the power savings are only on one type of instruction, 20x less power doesn't mean it runs 20x faster. You still need big fat GPUs.

If this increases integer demand and decreases floating point demand, that moderately changes future product design and doesn't do much else.

talldayo1y ago

> Bad news for Nvidia. “Sell your stock” bad.

People say this but then the fastest and most-used implementation of these optimizations is always written in CUDA. If this turns out to not be a hoax, I wouldn't be surprised to see Nvidia prices jump in correlation.

Nasrudith1y ago

Wouldn't reduced power consumption for an unfulfilled demand mean more demand for Nvida as they now need more chips to max out amount of power usage to capacity? (As concentration tends to be the more efficient way.)

faragon1y ago

Before reading the article I was expecting using 1-bit instead of bfloats, and using logical operators instead of arithmetic.

DrNosferatu1y ago

Why they don’t implement the algorithm in a FPGA to compare with a classical baseline?

Wheatman1y ago

Isnt 90% of the enrgy spent moving bytes around? Why would this have such a great affect?

m3kw91y ago

This sounds similar to someone saying room temp super conductor was discovered

tartakovsky1y ago

original paper: https://news.ycombinator.com/item?id=41784591

DesiLurker1y ago

validity of the claim aside, why dont they say reduces by 20 times instead of 95%. its much better perspective of a fraction when fraction is tiny.

nprateem1y ago

Is it the one where you delete 95% of user accounts?

neuroelectron1y ago

Nobody is interested in this because nobody wants less capex.

quantadev1y ago

I wonder if someone has feed this entire "problem" into the latest Chat GPT-01 (the new model with reasoning capability), and just fed it in all the code for a Multilayer Perceptron and then given it the task/prompt of finding ways to implement the same network using only integer operations.

Surely even the OpenAI devs must have done this like the minute they got done training that model, right? I wonder if they'd even admit it was an AI that came up with the solution rather than just publishing it, and taking credit. haha.

chx1y ago

You are imaging LLMs are capable of much more than they actually are. Here's the only thing they are good for.

https://hachyderm.io/@inthehands/112006855076082650

> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

> Alas, that does not remotely resemble how people are pitching this technology.

quantadev1y ago

No, I'm not imagining things. You are, however, imaging (incorrectly) that I'm not an expert with AI who's already seen superhuman performance out of LLM prompts in the vast majority of every software development question I've ever asked them, starting all the way back at GPT-3.5.

1 more reply

j / k navigate · click thread line to collapse

170 comments

djoldman1y ago

https://arxiv.org/abs/2410.00907

ABSTRACT

onlyrealcuzzo1y ago

Does this mean you can train efficiently without GPUs?

Presumably there will be a lot of interest.

crazygringo1y ago

No. But it does potentially mean that either current or future-tweaked GPUs could run a lot more efficiently -- meaning much faster or with much less energy consumption.

You still need the GPU parallelism though.

2 more replies

pnt121y ago

The GPU main advantage is its parallelism - thousands of cores compared handful of cores in CPUs.

If you're training models with billions of parameters, you're still gonna need that.

etcd1y ago

I feel like I have seen this idea a few times but don't recall where but stuff posted via HN.

Here https://news.ycombinator.com/item?id=41784591 but even before that. It is possibly one of those obvious ideas to people steeped in this.

To me intuitively using floats to make ultimatelty boolean like decisions seems wasteful but that seemed like the way it had to be to have diffetentiable algorithms.

yogrish1y ago

mvkel1y ago

Is this effectively quantizing without actually quantizing?

jart1y ago

It's a very crude approximation, e.g. 1.75 * 2.5 == 3 (although it seems better as the numbers get closer to 0).

I tried implementing this for AVX512 with tinyBLAS in llamafile.

    inline __m512 lmul512(__m512 x, __m512 y) {
        __m512i sign_mask = _mm512_set1_epi32(0x80000000);
        __m512i exp_mask = _mm512_set1_epi32(0x7F800000);
        __m512i mant_mask = _mm512_set1_epi32(0x007FFFFF);
        __m512i exp_bias = _mm512_set1_epi32(127);
        __m512i x_bits = _mm512_castps_si512(x);
        __m512i y_bits = _mm512_castps_si512(y);
        __m512i sign_x = _mm512_and_si512(x_bits, sign_mask);
        __m512i sign_y = _mm512_and_si512(y_bits, sign_mask);
        __m512i exp_x = _mm512_srli_epi32(_mm512_and_si512(x_bits, exp_mask), 23);
        __m512i exp_y = _mm512_srli_epi32(_mm512_and_si512(y_bits, exp_mask), 23);
        __m512i mant_x = _mm512_and_si512(x_bits, mant_mask);
        __m512i mant_y = _mm512_and_si512(y_bits, mant_mask);
        __m512i sign_result = _mm512_xor_si512(sign_x, sign_y);
        __m512i exp_result = _mm512_sub_epi32(_mm512_add_epi32(exp_x, exp_y), exp_bias);
        __m512i mant_result = _mm512_srli_epi32(_mm512_add_epi32(mant_x, mant_y), 1);
        __m512i result_bits = _mm512_or_si512(
            _mm512_or_si512(sign_result, _mm512_slli_epi32(exp_result, 23)), mant_result);
        return _mm512_castsi512_ps(result_bits);
    }

raluk1y ago

Your kernel seems to be incorrect for 1.75 * 2.5. From paper we have 1.75 == (1+0.75)*2^0 for 2.5 == (1+0.25)*2^1 so result is (1+0.75+0.25+2^-4)*2^1 == 4.125 (correct result is 4.375)

raluk1y ago

Another example, if we have for example 1.9 * 1.9 then we need to account for overflow in (0.9 + 0.9) and this seems to induce similar overhead as expressing numbers as (1-0.05)*2 .

kayo_202110301y ago

manquer1y ago

It is a click bait headline the claim itself is not extraordinary. the preprint from arxiv was posted here some time back .

The 95% gains is specifically only for multiplication operations, inference is compute light and memory heavy in the first place so the actual gains would be far less smaller .

Tech journalism (all journalism really) can hardly be trusted to publish grounded news with the focus on clicks and revenue they need to survive.

ksec1y ago

>Tech journalism (all journalism really) can hardly be trusted to publish grounded news with the focus on clicks and revenue they need to survive.

Right now the only way to gain real knowledge is actually to read comments of those articles.

kayo_202110301y ago

Thank you. That makes sense.

rob_c1y ago

Bingo,

We have a winner. Glad that came from someone not in my lectures on ML network design

Honestly, thanks for beeting me to this comment

throwawaymaths1y ago

Randor1y ago

The energy claims up to ~70% can be verified. The inference implementation is here:

https://github.com/microsoft/BitNet

kayo_202110301y ago

Also,

> Additionally, it reduces energy consumption by 55.4% to 70.0%

With humility, I don't know what that means. It seems like some dubious math with percentages.

2 more replies

littlestymaar1y ago

How does the liked article relate to BitNet at all? It's about the “addition is all you need” paper which AFAIK is unrelated.

1 more reply

vlovich1231y ago

kayo_202110301y ago

re: all above/below comments. It's still an extraordinary claim.

I'm not claiming it's not possible, nor am I claiming that it's not true, or, at least, honest.

throwawaymaths1y ago

> If the 95% savings actually has legs some smart chip manufacturer will do the math and make the chips

Terrible logic. By a similar logic we wouldn't be using python for machine learning at all, for example (or x86 for compute). Yet here we are.

1 more reply

stefan_1y ago

I mean, all these smart people would rather pay NVIDIA all their money than make AMD viable. And yet they tell us its all MatMul.

dotnet001y ago

kayo_202110301y ago

Both companies are doing pretty well. Why don't you think AMD is viable?

2 more replies

jhj1y ago

As someone who has worked in this space (approximate compute) on both GPUs and in silicon in my research, the power consumption claims are completely bogus, as are the accuracy claims:

> In this section, we show that L-Mul is more precise than fp8 e4m3 multiplications

> To be concise, we do not consider the rounding to nearest even mode in both error analysis and complexity estimation for both Mul and L-Mul

> applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products

_aavaa_1y ago

Original discussion of the preprint: https://news.ycombinator.com/item?id=41784591

codethief1y ago

Ahh, there it is! I was sure we had discussed this paper before.

remexre1y ago

mota71y ago

Not quite: It's taking advantage of (1+a)(1+b) = 1 + a + b + ab. And where a and b are both small-ish, ab is really small and can just be ignored.

So it turns the (1+a)(1+b) into 1+a+b. Which is definitely not the same! But it turns out, machine guessing apparently doesn't care much about the difference.

amelius1y ago

You might then as well replace the multiplication by the addition in the original network. In that case you're not even approximating anything.

Am I missing something?

1 more reply

tommiegannert1y ago

Plus the 2^-l(m) correction term.

Feels like multiplication shouldn't be needed for convergence, just monotonicity? I wonder how well it would perform if the model was actually trained the same way.

dsv3099i1y ago

This trick is used a ton when doing hand calculation in engineering as well. It can save a lot of work.

You're going to have tolerance on the result anyway, so what's a little more error. :)

convolvatron1y ago

yes. and the next question is 'ok, how do we add'

kps1y ago

dietr1ch1y ago

1 more reply

robomartin1y ago

I posted this about a week ago:

https://news.ycombinator.com/item?id=41816598

fidotron1y ago

Right, the ML people are learning, slowly, about the importance of optimizing for silicon simplicity, not just reduction of symbols in linear algebra.

Their rediscovery of fixed point was bad enough but the “omg if we represent poses as quaternions everything works better” makes any game engine dev for the last 30 years explode.

ausbah1y ago

a lot of things in the ML research space are rebranding an old concept w a new name as “novel”

ujikoluk1y ago

Explain more for the uninitiated please.

robomartin1y ago

In training a neural network, the back propagation math can be implemented using almost the same logic used for a polyphase FIR filter.

didgetmaster1y ago

Why not publish some actual benchmarks that prove your claim in even a few special cases?

dragonwriter1y ago

baq1y ago

OTOH you have a living proof that an amazingly huge neural network can work on 20W of power, so expecting multiple orders of magnitude in power consumption reduction is not unreasonable.

etcd1y ago

Mitochondria are all you need.

Should be able to go more efficient as the brain has other constraints such as working at 36.7 degrees C etc.

andrewstuart1y ago

https://github.com/microsoft/BitNet

jdiez171y ago

Damn. Seems almost too good to be true. Let’s see where this goes in two weeks.

1 more reply

TheRealPomax1y ago

Because as disappointing as modern life is, you need clickbait headlines to drive traffic. You did the right thing by reading the article though, that's where the information is, not the title.

phtrivier1y ago

Fair enough, but then I want a way to penalize publishers for abusing clickbait. There is no "unread" button, and there is no way to unsubscribe to advertisement-based sites.

Even on sites that have a "Like / Don't like" button, my understanding is that clicking "Don't like" is a form of "engagement", that the suggestion algorithm are going to reward.

Give me a button that says "this article was a scam", and have the publisher give the advertisement money back. Of better yet, give the advertisement money to charity / public services / whatever.

Take a cut of the money being transfered, charge the publishers for being able to get a "clickbait free" green mark if they implement the scheme.

Track the kind of articles that generate the most clickbait-angry comment. Sell back the data.

There might a business model.

1 more reply

keybored1y ago

Headlines: what can they do, they need that for the traffic

Reader: do the moral thing and read the article, not just the title

How is that balanced.

GistNoesis1y ago

Does https://en.wikipedia.org/wiki/Jevons_paradox apply in this case ?

mattxxx1y ago

That's interesting.

Obviously, energy cost creates a barrier to entry, so reduction of cost reduces the barrier to entry... which adds more players... which increases demand.

bicepjai1y ago

This is why I love HN

gosub1001y ago

Not necessarily a bad thing: this might give the AI charlatans enough time to actually make something useful.

narrator1y ago

Of course. Jevons paradox always applies.

holoduke1y ago

Art96811y ago

In other sensationalized words: "AI engineers can claim new algorithm allows them to fit GPT-5 in an RTX5090 running at 600 watts."

gcanyon1y ago

greenthrow1y ago

panosv1y ago

Lemurian Labs looks like it's doing something similar: https://www.lemurianlabs.com/technology They use the Logarithmic Number System (LNS)

ein0p1y ago

As a rule, compute only takes less than 10% of all energy. 90% is data movement.

idiliv1y ago

Duplicate, posted on October 9: https://news.ycombinator.com/item?id=41784591

hello_computer1y ago

How does this differ from Cussen & Ullman?

https://arxiv.org/abs/2307.01415

selimthegrim1y ago

Cussen is an HN poster incidentally.

littlestymaar1y ago

Related: https://news.ycombinator.com/item?id=41784591 10 days ago

andrewstuart1y ago

Here is the Microsoft implementation:

https://github.com/microsoft/BitNet

syntaxing1y ago

I’m looking forward to Bitnet adaptation. MS just released a tool for it similar to llamacpp. Really hoping major models get retrained for it.

creativenolo1y ago

Simple question: if true, would power consumption stay at 100% because we’d work the algorithm harder?

I had assumed the latency etc were based on what was desirable for the use case and hardware, rather than power consumption.

asicsarecool1y ago

Don't assume this isn't already in place at the main AI companies

svilen_dobrev1y ago

Maybe even generate a table of the approximate results and use that, in various stages? Like the way sin/cos was done 30y ago before FP coprocessors arrived

m4631y ago

So couldn't you design a GPU that uses or supports this algorithm to use the same power, but use bigger models, better models, or do more work?

DennisL1231y ago

This is a result on 8 bit numbers, right? Why not precompute all 64k possible combinations and look up the results from the table?

andrewstuart1y ago

The ultimate “you’re doing it wrong”.

For he sake of the climate and environment it would be nice to be true.

Bad news for Nvidia. “Sell your stock” bad.

Does it come with a demonstration?

mouse_1y ago

etcd1y ago

True. A laptop power pack wattage is probably pretty much unchanged over 30 years for example.

Dylan168071y ago

Bad news for Nvidia how? Even ignoring that the power savings are only on one type of instruction, 20x less power doesn't mean it runs 20x faster. You still need big fat GPUs.

If this increases integer demand and decreases floating point demand, that moderately changes future product design and doesn't do much else.

talldayo1y ago

> Bad news for Nvidia. “Sell your stock” bad.

Nasrudith1y ago

faragon1y ago

Before reading the article I was expecting using 1-bit instead of bfloats, and using logical operators instead of arithmetic.

DrNosferatu1y ago

Why they don’t implement the algorithm in a FPGA to compare with a classical baseline?

Wheatman1y ago

Isnt 90% of the enrgy spent moving bytes around? Why would this have such a great affect?

m3kw91y ago

This sounds similar to someone saying room temp super conductor was discovered

tartakovsky1y ago

original paper: https://news.ycombinator.com/item?id=41784591

DesiLurker1y ago

validity of the claim aside, why dont they say reduces by 20 times instead of 95%. its much better perspective of a fraction when fraction is tiny.

nprateem1y ago

Is it the one where you delete 95% of user accounts?

neuroelectron1y ago

Nobody is interested in this because nobody wants less capex.

quantadev1y ago

chx1y ago

You are imaging LLMs are capable of much more than they actually are. Here's the only thing they are good for.

https://hachyderm.io/@inthehands/112006855076082650

> Alas, that does not remotely resemble how people are pitching this technology.

quantadev1y ago

1 more reply

j / k navigate · click thread line to collapse