Show HN: Matrix Multiplication with Half the Multiplications (opens in new tab)

(github.com)

310 pointsemacs282y ago77 comments

77 comments

This looks pretty cool! What's the catch? e.g. why isn't this already implemented in accelerators, is it really just a forgotten algorithm, or this has some implications on the cost of building the accelerator or else?

lupire2y ago

It's not just a software algorithm. It's a hardware architecture optimization. To benefit, you have to build hardware that matches the dimensions of the algorithm. That's an expensive commitment.

emacs28OP2y ago

> you have to build hardware that matches the dimensions of the algorithm

Yes the benefits are realized in custom hardware designs as opposed to software, however, the hardware architectures work for multiplying matrices of arbitrary dimensions by splitting up larger matrices into smaller tiles, then summing up the tile products to form the final larger matrix products (i.e. GEMM)

SJC_Hacker2y ago

Not so much in FPGA ... although I'm not sure top end FPGAs would beat Nvidia TPUs even with this algorithm, and even if cost were not a consideration.

emacs28OP2y ago

IMHO, for fixed-point MM accelerators, there is no catch, I think it's an overlooked algorithm. It's based on an algorithm by Winograd who coincidentally also proposed another unrelated algorithm that later became very popular for CNN acceleration which would take some visibility away from this other algorithm by Winograd... But that is speculative

yorwba2y ago

On the other hand, if you tried it with floating point, you'd lose significant digits. Since the approach is to sum (a[i] + b[i+1])(a[i+1] + b[i]) and subtract the sums of a[i]a[i+1] and b[i]b[i+1] in the end to get a[i]b[i] + a[i+1]b[i+1], you may be taking the difference of two large values to get a small value, losing precision.

1 more reply

OJFord2y ago

LLM hype and this submission in particular keep making me think of a lecturer I had for Topics in Large Dimensional Data Processing, circa 2016: as I recall he was enthusiastically adamant that the most important thing, breakthroughs etc., in years/decades to come was going to be faster matrix operations. Anyway, I'm pretty sure I recognise FIP (not FFIP of course) from that course.

I wish I could remember his name, I believe he left academia after my year and went to work in industry, I'd just be curious to see what he's up to now. I'm not saying it was a particularly novel or prescient comment/attitude, we may not have had quite such ML hype but certainly 'big data' was all the rage at the time, it's just something that's stuck in my mind. One of those areas I always meant to study more, just realistically probably never had the mathematical chops for and certainly those I did have atrophied.

bee_rider2y ago

Maybe I’m joking, but: our society is just a vehicle for economics at this point, our economy is built around science, our science has mostly been turned into observations about engineering, some time ago we changed all of engineering into differential equations, and differential equations can be solved by discretizing them and doing linear algebra, and most of linear algebra can be done with matrix multiplications (triangular solves and orthonormalizations if you are fancy). All you need is matmul.

3 more replies

gojomo2y ago

Unless he was a guest lecturer, if the course was for credit, wouldn't his name appear on your official transcript?

1 more reply

pclmulqdq2y ago

There are a lot of matrix multiplication algorithms out there with a lot of pluses and minuses. It's always a balance of accuracy, runtime, and scaling. This one probably has bad accuracy in floating point.

emacs28OP2y ago

For everyone discussing the reduced accuracy/numerical stability of the algorithms in floating-point, this is true. But note that the application of the algorithms in the work is explored for fixed-point MM/quantized integer NN inference, not floating-point MM/inference. Hence, there is no reduction in accuracy for that application of it compared to using conventional fixed-point MM.

pclmulqdq2y ago

"Conventional fixed-point MM" is a large suite of algorithms. It is correct that this is a 2x reduction in MULs compared to naive fixed-point matrix multiply, but there is a large body of literature out there with other circuits. This is a cool trick to add to the group.

p1esk2y ago

Inference world is gradually switching from INT formats to FP formats. FP8 is already supported in modern hardware, and FP4 support is coming. In my experiments I get better perplexity in language models with FP4 than with INT4.

1 more reply

abetusk2y ago

I'm no expert but I suspect this is wrong. To me, this is like saying you don't need to worry about integer overflow because your operations are only working on fixed integers. Really? You don't care if you multiply or add two large numbers and they spill over?

The more appropriate answer, I suspect, is that the numerical precision and stability sacrifices are more than adequate for normal usage.

If I'm wrong about this, I would certainly like to know.

1 more reply

abetusk2y ago

I don't know why this answer is getting downvoted. This is absolutely correct.

W. Miller has a paper discussing, under conditions of numerical stability, O(n^3) multiplications is necessary [0]. Any algorithm that gets sub cubic runtime for matrix multiplication, like Strassen's or Coppersmith's, must sacrifice some amount of precision or stability.

[0] https://epubs.siam.org/doi/10.1137/0204009

gyrovagueGeist2y ago

Another relevant paper is: https://epubs.siam.org/doi/10.1137/15M1032168

waynecochran2y ago

The document said it outputs the exact same values as the conventional method. There is no accuracy trade off here.

pclmulqdq2y ago

The paper cited is about hardware, where there is no accuracy tradeoff because you control the numerical precision completely and use fixed point. In a software implementation, neither is true. There is no chance that you will get the exact same values out of this method that you do out of other FP matmuls.

p1esk2y ago

For floating point? Are you sure?

2 more replies

pbsd2y ago

It's not quite forgotten. It kind of lives on in the pseudo-dot product Wegman-Carter authenticators like UMAC. See Section 3 of [1] for context.

[1] https://cr.yp.to/antiforgery/pema-20071022.pdf

adastra222y ago

I’ve only glanced at it so someone correct me if I’m wrong, but IIUC this is not a replacement for matrix multiplication but rather an approximation that only gives decent-ish results for the types of linear systems you see in AI/ML. But for that use case it is totally fine?

emacs28OP2y ago

It produces identical/bit-equivalent results as conventional/naive matrix multiplication for integer/fixed-point data types

mariocesar2y ago

Perhaps it's less of a hidden gem and more of a spotlight moment.

ixaxaar2y ago

Man I remembered something similar I had tried working on in 2018, but gave up after all my PhD applications got rejected.

https://github.com/ixaxaar/pytorch-dni

The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.

yorwba2y ago

I'm not seeing the connection. This work is about low-level optimization of matrix multiplication. The repo you linked seems to be about replacing back-propagated gradients with a cheaper estimate. What's the similarity you see between these two?

ixaxaar2y ago

Correct, I think I mistook it as "use a small neural net to approximate matrix multiplication" instead it seems as "use cheaper replacements of matrix mul without much acc loss".

Wellll that means I can give dni another try :D

jebarker2y ago

This feels like a "no free lunch" situation. I would imagine that any time saving in approximating the gradients this way would be lost to needing to train for more iterations due to the loss in gradient accuracy. Is that not the case?

ixaxaar2y ago

I think that's the reason for its dead end.

However if this is really the biological analogue of credit assignment, this might scale better than training llms from scratch every time. Even if say it could approx gradients to a certain degree given a new network, normal backprop could further tune for a few epochs or so dramatically reducing overall training costs.

rollingtide2y ago

Unrelated to the technical discussion but I was wondering what you made that architecture gif with? Looks neat!

ixaxaar2y ago

I think that image is from the paper and was not created by me. Looks cool indeed!

michelpp2y ago

This is very cool and a real interesting read! For those in the comments confused about how this is better, the paper is talking about synthesizing matrix multiplication pipelines in hardware, like an FPGA or ASIC. On a CPU or GPU you won't notice because adds and multiplications take the same amount of time generally, but multiplication units takes up many more transistors, so if you can reduce the circuit complexity you can increase the speed and parallel throughput and reduce power and routing complexity. This approach could be particularly useful for efficient sparse matrix multiplication accelerators.

Another cool way to eliminate multiplication in matrix multiplication is to use different semirings [1]. The Tropical Semiring [2] for example substitutes addition for multiplication and min (or max) for addition. It's still matrix multiplication but with substituted binary operations. The research in this relatively new field of Tropical Algebra [3] is quite active and rich right now, being used for all kinds of optimization problems and in research for optimizing neural networks [4] . This approach also lends itself to hardware synthesis since most FPGA configurable logical blocks can add/min/max in one clock cycle, whereas efficient multiplication requires fixed dedicated on-chip hardware multipliers.

Another way to efficiently remove multiplications with a different but related semiring is to use a Log Semiring [5]. If you have to multiply chains of probabilities (like Markov chains) then the numbers quickly become very small and floating point loses its accuracy to represent the numbers. By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

[1] https://en.wikipedia.org/wiki/Semiring

[2] https://en.wikipedia.org/wiki/Tropical_semiring

[3] https://en.wikipedia.org/wiki/Tropical_geometry

[4] https://proceedings.mlr.press/v80/zhang18i/zhang18i.pdf

[5] https://en.wikipedia.org/wiki/Log_semiring

btown2y ago

The paper in [4] is absolutely fascinating - I'm very much a neophyte here, but I believe it shows that practically any ReLU network can be represented as a tropical ratio of two tropical polynomials, and thus can be analyzed with geometric principles including visualizations of surfaces. It's been cited in more recent work: https://scholar.google.com/scholar?cites=1003719112553620451... - does anyone know if there's been any significant advancements here?

gatane2y ago

Whoah, this is what Unified Algebra is all about!

http://www.cs.toronto.edu/~hehner/UA.pdf

nextaccountic2y ago

So what's the difference between this unified algebra, and universal algebra[0]?

[0] https://en.wikipedia.org/wiki/Universal_algebra

buybackoff2y ago

This is what HN is about :)

I understood a fraction but instantly wanted to dive into the topic just after reading such a knowledgeable comment.

jhj2y ago

> By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

Addition/subtraction in a logarithmic number system is way more expensive than what you would spend on multiplication, especially if you care about correctly rounded results, as the (hardware) LUTs required are rather big.

pk-protect-ai2y ago

> By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

Isn't this the same approach as in GF(2^x), which has been in use for decades? The only limitation that comes to mind is the field size.

jhj2y ago

Finite field log/antilog lookup tables are used for efficient-ish multiplication, similar to addition/subtraction tables used for logarithmic number systems.

bearzoo2y ago

somewhat related is the number theoretic transform

https://ieeexplore.ieee.org/abstract/document/1451721

Drakim2y ago

I'm surprised this actually works, usually detecting whether to use multiplication or addition is slower than simply using multiplication. Especially if it's massive amounts of work being done in parallel.

1letterunixname2y ago

Wonder how well it compares to openblas and cublas.

skykooler2y ago

I find it fascinating that this is using a process invented in 1968 and hasn't been used for this purpose until now!

pk-protect-ai2y ago

Hey, nobody knew what to do with GF(2^x) up until mid last century either... Oh wait, CS was not really a thing almost up until mid last century...

Lucasoato2y ago

If you're interested in the mathematical theory behind sub-cubic algorithms for matrix multiplications, you can start from here: https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...

I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.

(Now proven for for 2+j = w = 2.3728596, or j > 0.3728596)

gizmo6862y ago

> I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.

Is this stated correctly? Because it seems almost meaningless as stated. You start with "for every j, there exists an n such that...". That would mean that for the rest of the statement, n and j are constant. So you are just saying that you can multiply constant sized matrices in constant time. Technically true, but I feel like you are trying to claim something stronger.

JohnKemeny2y ago

It should simply say:

for any j>0 there exists an algorithm multiplying nxn matrices in time O(n^{2+j}).

Lucasoato2y ago

You are correct, I apologize for the confusion! :)

1 more reply

bee_rider2y ago

It does seem to be harder to make progress over time though. Maybe it’ll bottom out at j=1/e. I won’t call my guess even a conjecture though, it just happens to be a convenient constant near where the current value is. It would be a funny prank for math to pull on us.

abeppu2y ago

Predicting that this holds for any j > 0 seems rather bold. Would you care to share your intuition why you think that's the case?

roflmaostc2y ago

Two matrices with size NxN each can be multiplied naively with the schoolbook algorithm in O(N^3).

It's clear that the algorithm needs at least O(N^2) because to access each element of the matrices once, you need a double for loop, which is O(N^2).

  for i in rows
      for j in cols
          # do something with element matrix1 [i, j], matrix2[i, j],...

so it has to be j >= 0

JohnKemeny2y ago

His question was: what is the reasoning behind there existing an algorithm running in time n^2+epsilon for really small epsilon.

abeppu2y ago

Yeah, we're agreed that j cannot be less than 0. But your conjecture was about _every j > 0_. Do you have any specific line of reasoning which suggests that j can be arbitrarily close to 0 (0 is the greatest lower bound)? Why do you not think there's some other specific limit k \in (0, 0.3728596] beyond which j cannot be improved?

barfbagginus2y ago

This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?

And the diagrams are chaotic and don't really explain anything about why this approach is fast or good. The result is that I'm reluctant to even click-through to the PDF.

If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations, rather than things that may as well be designed to hype people too busy to tell you that you are cranks. It's hard to tell if this is incredibly groundbreaking or just but nothingburger. Sadly I almost feel like that must be an intentional decision motivated by poor merits of work and a desire to exploit AI height. The alternative - which I prefer to believe is the case - is that the author simply needs to revise and better contextualize.

Someone2y ago

> What is the Big O run time on this?

The claim is they’re dropping half the multiplications, so it isn’t doing anything for Big O.

> If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations,

The math explaining how to halve the number of multiplications in the paper (https://arxiv.org/abs/2311.12224) isn’t hard to understand.

You only have to read formulas 2 (traditional matrix multiplication) and 3 to 6.

I think it’s clear it does do what’s being advertised, halving the number of multiplications at the cost of lots of extra additions/subtractions.

They then go on to better vectorize that algorithm. That, as is usual for that, gets looking messy soon.

My main concern would be numerical stability.

emacs28OP2y ago

Thanks, good summary. Regarding numerical stability, the application is for fixed-point arithmetic, and therefore numerical stability is not an issue (the result is identical compared to using the conventional inner-product)

stephencanon2y ago

The readme doesn't explain much, but the introduction to the paper itself is quite approachable.

As for whether it's groundbreaking or not ... it's a neat readily-accessible constant-factor improvement for area-constrained fixed-point accelerators. That doesn't change everything overnight, but neither is it nothing. It's nice work.

eigenket2y ago

> This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?

Without wishing to sound elitist I, I don't understand the point of this comment at all. If you don't understand Big O notation enough to know that "half the multiplications" doesn't change it then why are you even asking about it?

barfbagginus2y ago

I made two really bad misunderstandings which I completely own. I can't seem to edit it, so please let this be my refutation and explanation of my own mistake.

First misunderstanding: I assumed this was a new large matrix multiplication algorithm building on the hype from last week or so where we saw this paper: https://arxiv.org/abs/2210.10173

It is not an algorithm, but a hardware design - a systolic array using roughly half of the silicon area of a baseline design.

2. Assuming that we were talking about an algorithm, I then further assumed that it reduce an algorithm's multiplications by half for some important n. I assumed that it did this by accelerating some critical sub procedure in the baseline algorithm into a more efficient big O class without really changing the multiplicative factor. This is a common way to reduce the number of operations of an algorithm for some fixed n. As a consequence I thought that the author must be being sloppy by not telling us the full big o details of the improvement, and just picking some n where it just so happened that half of the multiplications vanish. That also seemed unlikely to be a consequence of a improvement in the bound of matrix multiplication, given how incredibly slow the progress on matrix multiplication bound has been. So I thought that the author might even be a crank.

But it turned out the sloppy thinking was on me. I was being the crank.

Reading the paper's introduction made it very very clear that we were dealing with a systolic array that reduces silicon area per compute.

Even worse that information is there in the first sentence of the readme as well.

Would a clearer sentence have helped me? Something like:

"We introduce a VHDL hardware design for a systolic array that nearly halves the silicon area of a baseline array, by replacing half of the multiplication-accumulate units (MACs) with simple adder units, exploiting Winograd's 1967 fast inner product formula (FIP)."

I'm honestly not sure, given how bad my mistake was to begin with. Not even the diagrams tipped me off - in hind site they are very obviously hardware block diagrams, but I thought that they were just needlessly complicated algorithmic diagrams! How silly!

I still believe that the readme could be simplified for a general audience of goobers like me. But first and foremost I have to admit that I was being a goober!

Does that help you understand my mistake here? I do understand Big O and why cutting operations by half is typically a constant factor improvement. But apparently I don't understand it well enough to prevent me from retconning a narrative with some very stinky assumptions and then projecting them on to the poor innocent hardware designer. Not very proud about that.

mariocesar2y ago

It´s actually fairly clear

hackyhacky2y ago

Not to everyone. If it's clear to you, you could helpfully explain it.

VogonPoetry2y ago

This is an analogy. a^2 - b^2 = aa - bb. This can be factored to (a+b)(a-b). In the first expression there are two multiplies, in the factored version there is only one.

However, from a numerical analysis / accuracy standpoint, evaluating the factored expression can result in loss of precision in the result when a is close to b. This is especially true if you repeatedly and sequentially do a lot of these operations. Loss of precision can be a problem in numeral modeling (like climate simulation) -- long term predictions diverge.

Given that there is a drive to use greatly reduced precision in ML engines, loss of precision might have an effect on how a model performs. Then again, it might not. I haven't read a lot of papers on ML, but I don't recall seeing ones that try to quantify how sensitive a model is to error propagation. (I am making a distinction between tests where the precision is reduced to see where it breaks down v.s. calculating / understanding what the error level actually is in a model)

1 more reply

j / k navigate · click thread line to collapse

77 comments

halflings2y ago

lupire2y ago

It's not just a software algorithm. It's a hardware architecture optimization. To benefit, you have to build hardware that matches the dimensions of the algorithm. That's an expensive commitment.

emacs28OP2y ago

> you have to build hardware that matches the dimensions of the algorithm

SJC_Hacker2y ago

Not so much in FPGA ... although I'm not sure top end FPGAs would beat Nvidia TPUs even with this algorithm, and even if cost were not a consideration.

emacs28OP2y ago

yorwba2y ago

1 more reply

OJFord2y ago

bee_rider2y ago

3 more replies

gojomo2y ago

Unless he was a guest lecturer, if the course was for credit, wouldn't his name appear on your official transcript?

1 more reply

pclmulqdq2y ago

emacs28OP2y ago

pclmulqdq2y ago

p1esk2y ago

1 more reply

abetusk2y ago

The more appropriate answer, I suspect, is that the numerical precision and stability sacrifices are more than adequate for normal usage.

If I'm wrong about this, I would certainly like to know.

1 more reply

abetusk2y ago

I don't know why this answer is getting downvoted. This is absolutely correct.

[0] https://epubs.siam.org/doi/10.1137/0204009

gyrovagueGeist2y ago

Another relevant paper is: https://epubs.siam.org/doi/10.1137/15M1032168

waynecochran2y ago

The document said it outputs the exact same values as the conventional method. There is no accuracy trade off here.

pclmulqdq2y ago

p1esk2y ago

For floating point? Are you sure?

2 more replies

pbsd2y ago

It's not quite forgotten. It kind of lives on in the pseudo-dot product Wegman-Carter authenticators like UMAC. See Section 3 of [1] for context.

[1] https://cr.yp.to/antiforgery/pema-20071022.pdf

adastra222y ago

emacs28OP2y ago

It produces identical/bit-equivalent results as conventional/naive matrix multiplication for integer/fixed-point data types

mariocesar2y ago

Perhaps it's less of a hidden gem and more of a spotlight moment.

ixaxaar2y ago

Man I remembered something similar I had tried working on in 2018, but gave up after all my PhD applications got rejected.

https://github.com/ixaxaar/pytorch-dni

The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.

yorwba2y ago

ixaxaar2y ago

Correct, I think I mistook it as "use a small neural net to approximate matrix multiplication" instead it seems as "use cheaper replacements of matrix mul without much acc loss".

Wellll that means I can give dni another try :D

jebarker2y ago

ixaxaar2y ago

I think that's the reason for its dead end.

rollingtide2y ago

Unrelated to the technical discussion but I was wondering what you made that architecture gif with? Looks neat!

ixaxaar2y ago

I think that image is from the paper and was not created by me. Looks cool indeed!

michelpp2y ago

[1] https://en.wikipedia.org/wiki/Semiring

[2] https://en.wikipedia.org/wiki/Tropical_semiring

[3] https://en.wikipedia.org/wiki/Tropical_geometry

[4] https://proceedings.mlr.press/v80/zhang18i/zhang18i.pdf

[5] https://en.wikipedia.org/wiki/Log_semiring

btown2y ago

gatane2y ago

Whoah, this is what Unified Algebra is all about!

http://www.cs.toronto.edu/~hehner/UA.pdf

nextaccountic2y ago

So what's the difference between this unified algebra, and universal algebra[0]?

[0] https://en.wikipedia.org/wiki/Universal_algebra

buybackoff2y ago

This is what HN is about :)

I understood a fraction but instantly wanted to dive into the topic just after reading such a knowledgeable comment.

jhj2y ago

> By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

pk-protect-ai2y ago

> By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

Isn't this the same approach as in GF(2^x), which has been in use for decades? The only limitation that comes to mind is the field size.

jhj2y ago

Finite field log/antilog lookup tables are used for efficient-ish multiplication, similar to addition/subtraction tables used for logarithmic number systems.

bearzoo2y ago

somewhat related is the number theoretic transform

https://ieeexplore.ieee.org/abstract/document/1451721

Drakim2y ago

1letterunixname2y ago

Wonder how well it compares to openblas and cublas.

skykooler2y ago

I find it fascinating that this is using a process invented in 1968 and hasn't been used for this purpose until now!

pk-protect-ai2y ago

Hey, nobody knew what to do with GF(2^x) up until mid last century either... Oh wait, CS was not really a thing almost up until mid last century...

Lucasoato2y ago

If you're interested in the mathematical theory behind sub-cubic algorithms for matrix multiplications, you can start from here: https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...

I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.

(Now proven for for 2+j = w = 2.3728596, or j > 0.3728596)

gizmo6862y ago

> I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.

JohnKemeny2y ago

It should simply say:

for any j>0 there exists an algorithm multiplying nxn matrices in time O(n^{2+j}).

Lucasoato2y ago

You are correct, I apologize for the confusion! :)

1 more reply

bee_rider2y ago

abeppu2y ago

Predicting that this holds for any j > 0 seems rather bold. Would you care to share your intuition why you think that's the case?

roflmaostc2y ago

Two matrices with size NxN each can be multiplied naively with the schoolbook algorithm in O(N^3).

It's clear that the algorithm needs at least O(N^2) because to access each element of the matrices once, you need a double for loop, which is O(N^2).

  for i in rows
      for j in cols
          # do something with element matrix1 [i, j], matrix2[i, j],...

so it has to be j >= 0

JohnKemeny2y ago

His question was: what is the reasoning behind there existing an algorithm running in time n^2+epsilon for really small epsilon.

abeppu2y ago

barfbagginus2y ago

This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?

And the diagrams are chaotic and don't really explain anything about why this approach is fast or good. The result is that I'm reluctant to even click-through to the PDF.

Someone2y ago

> What is the Big O run time on this?

The claim is they’re dropping half the multiplications, so it isn’t doing anything for Big O.

> If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations,

The math explaining how to halve the number of multiplications in the paper (https://arxiv.org/abs/2311.12224) isn’t hard to understand.

You only have to read formulas 2 (traditional matrix multiplication) and 3 to 6.

I think it’s clear it does do what’s being advertised, halving the number of multiplications at the cost of lots of extra additions/subtractions.

They then go on to better vectorize that algorithm. That, as is usual for that, gets looking messy soon.

My main concern would be numerical stability.

emacs28OP2y ago

stephencanon2y ago

The readme doesn't explain much, but the introduction to the paper itself is quite approachable.

eigenket2y ago

> This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?

barfbagginus2y ago

I made two really bad misunderstandings which I completely own. I can't seem to edit it, so please let this be my refutation and explanation of my own mistake.

First misunderstanding: I assumed this was a new large matrix multiplication algorithm building on the hype from last week or so where we saw this paper: https://arxiv.org/abs/2210.10173

It is not an algorithm, but a hardware design - a systolic array using roughly half of the silicon area of a baseline design.

But it turned out the sloppy thinking was on me. I was being the crank.

Reading the paper's introduction made it very very clear that we were dealing with a systolic array that reduces silicon area per compute.

Even worse that information is there in the first sentence of the readme as well.

Would a clearer sentence have helped me? Something like:

I still believe that the readme could be simplified for a general audience of goobers like me. But first and foremost I have to admit that I was being a goober!

mariocesar2y ago

It´s actually fairly clear

hackyhacky2y ago

Not to everyone. If it's clear to you, you could helpfully explain it.

VogonPoetry2y ago

This is an analogy. a^2 - b^2 = aa - bb. This can be factored to (a+b)(a-b). In the first expression there are two multiplies, in the factored version there is only one.

1 more reply

j / k navigate · click thread line to collapse