Yes the benefits are realized in custom hardware designs as opposed to software, however, the hardware architectures work for multiplying matrices of arbitrary dimensions by splitting up larger matrices into smaller tiles, then summing up the tile products to form the final larger matrix products (i.e. GEMM)
I wish I could remember his name, I believe he left academia after my year and went to work in industry, I'd just be curious to see what he's up to now. I'm not saying it was a particularly novel or prescient comment/attitude, we may not have had quite such ML hype but certainly 'big data' was all the rage at the time, it's just something that's stuck in my mind. One of those areas I always meant to study more, just realistically probably never had the mathematical chops for and certainly those I did have atrophied.
W. Miller has a paper discussing, under conditions of numerical stability, O(n^3) multiplications is necessary [0]. Any algorithm that gets sub cubic runtime for matrix multiplication, like Strassen's or Coppersmith's, must sacrifice some amount of precision or stability.
https://github.com/ixaxaar/pytorch-dni
The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.
Wellll that means I can give dni another try :D
However if this is really the biological analogue of credit assignment, this might scale better than training llms from scratch every time. Even if say it could approx gradients to a certain degree given a new network, normal backprop could further tune for a few epochs or so dramatically reducing overall training costs.
Another cool way to eliminate multiplication in matrix multiplication is to use different semirings [1]. The Tropical Semiring [2] for example substitutes addition for multiplication and min (or max) for addition. It's still matrix multiplication but with substituted binary operations. The research in this relatively new field of Tropical Algebra [3] is quite active and rich right now, being used for all kinds of optimization problems and in research for optimizing neural networks [4] . This approach also lends itself to hardware synthesis since most FPGA configurable logical blocks can add/min/max in one clock cycle, whereas efficient multiplication requires fixed dedicated on-chip hardware multipliers.
Another way to efficiently remove multiplications with a different but related semiring is to use a Log Semiring [5]. If you have to multiply chains of probabilities (like Markov chains) then the numbers quickly become very small and floating point loses its accuracy to represent the numbers. By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).
[1] https://en.wikipedia.org/wiki/Semiring
[2] https://en.wikipedia.org/wiki/Tropical_semiring
[3] https://en.wikipedia.org/wiki/Tropical_geometry
I understood a fraction but instantly wanted to dive into the topic just after reading such a knowledgeable comment.
Addition/subtraction in a logarithmic number system is way more expensive than what you would spend on multiplication, especially if you care about correctly rounded results, as the (hardware) LUTs required are rather big.
Isn't this the same approach as in GF(2^x), which has been in use for decades? The only limitation that comes to mind is the field size.
I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.
(Now proven for for 2+j = w = 2.3728596, or j > 0.3728596)
Is this stated correctly? Because it seems almost meaningless as stated. You start with "for every j, there exists an n such that...". That would mean that for the rest of the statement, n and j are constant. So you are just saying that you can multiply constant sized matrices in constant time. Technically true, but I feel like you are trying to claim something stronger.
for any j>0 there exists an algorithm multiplying nxn matrices in time O(n^{2+j}).
It's clear that the algorithm needs at least O(N^2) because to access each element of the matrices once, you need a double for loop, which is O(N^2).
for i in rows
for j in cols
# do something with element matrix1 [i, j], matrix2[i, j],...
so it has to be j >= 0And the diagrams are chaotic and don't really explain anything about why this approach is fast or good. The result is that I'm reluctant to even click-through to the PDF.
If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations, rather than things that may as well be designed to hype people too busy to tell you that you are cranks. It's hard to tell if this is incredibly groundbreaking or just but nothingburger. Sadly I almost feel like that must be an intentional decision motivated by poor merits of work and a desire to exploit AI height. The alternative - which I prefer to believe is the case - is that the author simply needs to revise and better contextualize.
The claim is they’re dropping half the multiplications, so it isn’t doing anything for Big O.
> If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations,
The math explaining how to halve the number of multiplications in the paper (https://arxiv.org/abs/2311.12224) isn’t hard to understand.
You only have to read formulas 2 (traditional matrix multiplication) and 3 to 6.
I think it’s clear it does do what’s being advertised, halving the number of multiplications at the cost of lots of extra additions/subtractions.
They then go on to better vectorize that algorithm. That, as is usual for that, gets looking messy soon.
My main concern would be numerical stability.
As for whether it's groundbreaking or not ... it's a neat readily-accessible constant-factor improvement for area-constrained fixed-point accelerators. That doesn't change everything overnight, but neither is it nothing. It's nice work.
Without wishing to sound elitist I, I don't understand the point of this comment at all. If you don't understand Big O notation enough to know that "half the multiplications" doesn't change it then why are you even asking about it?
First misunderstanding: I assumed this was a new large matrix multiplication algorithm building on the hype from last week or so where we saw this paper: https://arxiv.org/abs/2210.10173
It is not an algorithm, but a hardware design - a systolic array using roughly half of the silicon area of a baseline design.
2. Assuming that we were talking about an algorithm, I then further assumed that it reduce an algorithm's multiplications by half for some important n. I assumed that it did this by accelerating some critical sub procedure in the baseline algorithm into a more efficient big O class without really changing the multiplicative factor. This is a common way to reduce the number of operations of an algorithm for some fixed n. As a consequence I thought that the author must be being sloppy by not telling us the full big o details of the improvement, and just picking some n where it just so happened that half of the multiplications vanish. That also seemed unlikely to be a consequence of a improvement in the bound of matrix multiplication, given how incredibly slow the progress on matrix multiplication bound has been. So I thought that the author might even be a crank.
But it turned out the sloppy thinking was on me. I was being the crank.
Reading the paper's introduction made it very very clear that we were dealing with a systolic array that reduces silicon area per compute.
Even worse that information is there in the first sentence of the readme as well.
Would a clearer sentence have helped me? Something like:
"We introduce a VHDL hardware design for a systolic array that nearly halves the silicon area of a baseline array, by replacing half of the multiplication-accumulate units (MACs) with simple adder units, exploiting Winograd's 1967 fast inner product formula (FIP)."
I'm honestly not sure, given how bad my mistake was to begin with. Not even the diagrams tipped me off - in hind site they are very obviously hardware block diagrams, but I thought that they were just needlessly complicated algorithmic diagrams! How silly!
I still believe that the readme could be simplified for a general audience of goobers like me. But first and foremost I have to admit that I was being a goober!
Does that help you understand my mistake here? I do understand Big O and why cutting operations by half is typically a constant factor improvement. But apparently I don't understand it well enough to prevent me from retconning a narrative with some very stinky assumptions and then projecting them on to the poor innocent hardware designer. Not very proud about that.