ABSTRACT
Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication (L-Mul) algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8 e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8 e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8 e4m3 as accumulation precision in both fine-tuning and inference.
Presumably there will be a lot of interest.
You still need the GPU parallelism though.
If you're training models with billions of parameters, you're still gonna need that.
Here https://news.ycombinator.com/item?id=41784591 but even before that. It is possibly one of those obvious ideas to people steeped in this.
To me intuitively using floats to make ultimatelty boolean like decisions seems wasteful but that seemed like the way it had to be to have diffetentiable algorithms.
I tried implementing this for AVX512 with tinyBLAS in llamafile.
inline __m512 lmul512(__m512 x, __m512 y) {
__m512i sign_mask = _mm512_set1_epi32(0x80000000);
__m512i exp_mask = _mm512_set1_epi32(0x7F800000);
__m512i mant_mask = _mm512_set1_epi32(0x007FFFFF);
__m512i exp_bias = _mm512_set1_epi32(127);
__m512i x_bits = _mm512_castps_si512(x);
__m512i y_bits = _mm512_castps_si512(y);
__m512i sign_x = _mm512_and_si512(x_bits, sign_mask);
__m512i sign_y = _mm512_and_si512(y_bits, sign_mask);
__m512i exp_x = _mm512_srli_epi32(_mm512_and_si512(x_bits, exp_mask), 23);
__m512i exp_y = _mm512_srli_epi32(_mm512_and_si512(y_bits, exp_mask), 23);
__m512i mant_x = _mm512_and_si512(x_bits, mant_mask);
__m512i mant_y = _mm512_and_si512(y_bits, mant_mask);
__m512i sign_result = _mm512_xor_si512(sign_x, sign_y);
__m512i exp_result = _mm512_sub_epi32(_mm512_add_epi32(exp_x, exp_y), exp_bias);
__m512i mant_result = _mm512_srli_epi32(_mm512_add_epi32(mant_x, mant_y), 1);
__m512i result_bits = _mm512_or_si512(
_mm512_or_si512(sign_result, _mm512_slli_epi32(exp_result, 23)), mant_result);
return _mm512_castsi512_ps(result_bits);
}
Then I used it for Llama-3.2-3B-Instruct.F16.gguf and it outputted jibberish. So you would probably have to train and design your model specifically to use this multiplication approximation in order for it to work. Or maybe I'd have to tune the model so that only certain layers and/or operations use the approximation. However the speed was decent. Prefill only dropped from 850 tokens per second to 200 tok/sec on my threadripper. Prediction speed was totally unaffected, staying at 34 tok/sec. I like how the code above generates vpternlog ops. So if anyone ever designs an LLM architecture and releases weights on Hugging Face that use this algorithm, we'll be able to run them reasonably fast without special hardware.Another example, if we have for example 1.9 * 1.9 then we need to account for overflow in (0.9 + 0.9) and this seems to induce similar overhead as expressing numbers as (1-0.05)*2 .
The 95% gains is specifically only for multiplication operations, inference is compute light and memory heavy in the first place so the actual gains would be far less smaller .
Tech journalism (all journalism really) can hardly be trusted to publish grounded news with the focus on clicks and revenue they need to survive.
Right now the only way to gain real knowledge is actually to read comments of those articles.
We have a winner. Glad that came from someone not in my lectures on ML network design
Honestly, thanks for beeting me to this comment
If I could have a SWAG at it I would say a low resolution model like llama-2 would probably be just fine (llama-2 quantizes without too much headache) but a higher resolution model like llama-3 probably not so much, not without massive retraining anyways.
Also,
> Additionally, it reduces energy consumption by 55.4% to 70.0%
With humility, I don't know what that means. It seems like some dubious math with percentages.
I'm not claiming it's not possible, nor am I claiming that it's not true, or, at least, honest.
But, there will need to be evidence that using real machines, and using real energy an _equivalent performance_ is achievable. A defense that "there are no suitable chips" is a bit disingenuous. If the 95% savings actually has legs some smart chip manufacturer will do the math and make the chips. If it's correct, that chip making firm will make a fortune. If it's not, they won't.
Terrible logic. By a similar logic we wouldn't be using python for machine learning at all, for example (or x86 for compute). Yet here we are.
> In this section, we show that L-Mul is more precise than fp8 e4m3 multiplications
> To be concise, we do not consider the rounding to nearest even mode in both error analysis and complexity estimation for both Mul and L-Mul
These two statements together are non-sensical. Sure, if you analyze accuracy while ignoring the part of the algorithm that gives you accuracy in the baseline you can derive whatever cherry-picked result you want.
The multiplication of two floating point values if you round to nearest even will be the correctly rounded result of multiplying the original values at infinite precision, this is how floating point rounding usually works and what IEEE 754 mandates for fundamental operations if you choose to follow those guidelines (e.g., multiplication here). But not rounding to nearest even will result in a lot more quantization noise, and biased noise at that too.
> applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products
A good chunk of the energy cost is simply moving data between memories (especially external DRAM/HBM/whatever) and along wires, buffering values in SRAMs and flip-flops and the like. Combinational logic cost is usually not a big deal. While having a ton of fixed-function matrix multipliers does raise the cost of combinational logic quite a bit, at most what they have will probably cut the power of an overall accelerator by 10-20% or so.
> In this section, we demonstrate that L-Mul can replace tensor multiplications in the attention mechanism without any loss of performance, whereas using fp8 multiplications for the same purpose degrades inference accuracy
I may have missed it in the paper, but they have provided no details on (re)scaling and/or using higher precision accumulation for intermediate results as one would experience on an H100 for instance. Without this information, I don't trust these evaluation results either.
So it turns the (1+a)(1+b) into 1+a+b. Which is definitely not the same! But it turns out, machine guessing apparently doesn't care much about the difference.
Am I missing something?
Feels like multiplication shouldn't be needed for convergence, just monotonicity? I wonder how well it would perform if the model was actually trained the same way.
You're going to have tolerance on the result anyway, so what's a little more error. :)
https://news.ycombinator.com/item?id=41816598
This has been done for decades in digital circuits, FPGA’s, Digital Signal Processing, etc. Floating point is both resource and power intensive and using FP without the use of dedicated FP processing hardware is something that has been avoided and done without for decades unless absolutely necessary.
Their rediscovery of fixed point was bad enough but the “omg if we represent poses as quaternions everything works better” makes any game engine dev for the last 30 years explode.
At basic level it is very simple: A 10 bit bus gives you the ability to represent numbers between 0 and 1 with a resolution of approximately 0.001. 12 bits would be four times better. Integer circuits can do the math in one clock cycle. Hardware multipliers do the same. To rescale the numbers after multiplication you just take the N high bits, where N is your bus width; which is a zero clock-cycle operation. Etc.
In training a neural network, the back propagation math can be implemented using almost the same logic used for a polyphase FIR filter.
Why not publish some actual benchmarks that prove your claim in even a few special cases?
And, two, because the actual energy cost savings claimed aren't even the experimental question -- the energy cost differences between various operations on modern hardware have been established in other research, the experimental issue here was whether the mathematical technique that enables using the lower energy cost operations performs competitively on output quality with existing implementations when substituted in for LLM inference.
Should be able to go more efficient as the brain has other constraints such as working at 36.7 degrees C etc.
"The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. More details will be provided soon."
Even on sites that have a "Like / Don't like" button, my understanding is that clicking "Don't like" is a form of "engagement", that the suggestion algorithm are going to reward.
Give me a button that says "this article was a scam", and have the publisher give the advertisement money back. Of better yet, give the advertisement money to charity / public services / whatever.
Take a cut of the money being transfered, charge the publishers for being able to get a "clickbait free" green mark if they implement the scheme.
Track the kind of articles that generate the most clickbait-angry comment. Sell back the data.
There might a business model.
Reader: do the moral thing and read the article, not just the title
How is that balanced.
Obviously, energy cost creates a barrier to entry, so reduction of cost reduces the barrier to entry... which adds more players... which increases demand.
In other sensationalized words: "AI engineers can claim new algorithm allows them to fit GPT-5 in an RTX5090 running at 600 watts."
I had assumed the latency etc were based on what was desirable for the use case and hardware, rather than power consumption.
Maybe even generate a table of the approximate results and use that, in various stages? Like the way sin/cos was done 30y ago before FP coprocessors arrived
For he sake of the climate and environment it would be nice to be true.
Bad news for Nvidia. “Sell your stock” bad.
Does it come with a demonstration?
If this increases integer demand and decreases floating point demand, that moderately changes future product design and doesn't do much else.
People say this but then the fastest and most-used implementation of these optimizations is always written in CUDA. If this turns out to not be a hoax, I wouldn't be surprised to see Nvidia prices jump in correlation.
Surely even the OpenAI devs must have done this like the minute they got done training that model, right? I wonder if they'd even admit it was an AI that came up with the solution rather than just publishing it, and taking credit. haha.
https://hachyderm.io/@inthehands/112006855076082650
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.