SSMs are doing exponentially weighted moving averages (EMA). That's it- to summarize the past, you take an EMA of a variable output at each time step. Mamba changes one key thing- instead of decaying the past by a fixed amount each step as in a constant-time EMA, we have another output which decides how much to forget, or equivalently, how much 'time' has passed since the last observation in our EMA.
All of the matrix equations, continuous time, discretization, etc, will end up with a dynamic-forgetting EMA as I describe above. This also makes the benefits and limitations clear- finite state size, has to decide at a given layer what to forget before it sees the past at that layer.
Something I've noticed is that B, C and Δ depend only on the current token. See this: https://www.kolaayonrinde.com/blog/images/mamba/ssm_algorith... -- Another thing is that I've noticed that the definition of "SSM" in the image I've linked to is apparently recursive. This is also in the Arxiv paper. Strange.
+1 though for making me go back to the article and read it more carefully! +1 also to the article.
tf-idf and similar heuristics are what we were using before attention came along, e.g. tf-idf weighted bag-of-words representation of word2vec embeddings. That approaches fails in so many cases.
The same could be said of "control vectors" [1]. Both ideas are still experimental, but is seems to me IINM that they could replace "system prompts" and "RAG" respectively.
i.e. if I want the model to give me a summary of the news today, and the model was trained before today, can control vectors help?
The concepts behind control vectors, i.e. "representation engineering" are not especially new and have been highly effective in the diffusion space. I always find it entertaining when LLM folks act like they're discovering stuff that waifu stable diffusion folks knew for 6 months + about - like "concept slider loras".
I'm familiar with our intrepid stable diffusion sailors.
I don't know why you think the post is being downvoted.
I don't know why it would be verboten to downvote it, or indicative of the downvoter being an LLM fanatic who thinks they discovered everything.
I am puzzled by the post because it claims RAG can be replaced by control vectors.
I'm also puzzled because it claims prompts can be replaced by control vectors.
I get that if system prompts were only to shift output tone, control vectors could replace that case, but that seems narrow compared to the full set of things prompt input enables (inter alia, the in-context learning)
In that context, all new research directions are valuable simply for the fact that they're expanding the foundation of the field. 5 years from now, who knows what the most effective models will use under the hood, but the more we can learn about them in general, the better.
In 2018, with the release of transformers (via google) it enabled much more rapid training of models and more generalization with less data. 100% of the LLMs (as you’d probably thing of them)trace their origins to BERT.
That said, my team was working with hundred million to low billions of parameter LSTMs & CNNs back in 2016-2017 that were comparable to some lighter weight LLMs today.
In my opinion, the greatest strides in the space has less to do with the underlying architecture, and more to do with improved data formatting, accessibility and compute improvements.
Most of the research that enabled ChatGPT was also already known. "Attention is all you need" was a 2017 paper.
It still is a fast evolving field, but not one that just kicked off.
Let’s rephrase it. If their architecture is superior, and they have $30 million dollars, and similar preparation for training, and similar operational teams during training, then we can see if they can beat the model they’re comparing themselves to. Except, the alternatives don’t have tens of millions of dollars with the best support teams. So, the proof you seek hasn’t had a chance to happen due to severe lack of resources.
Hence, comparisons to GPT2 and small versions of GPT3. Even that might not be fair given the money and teams behind even small GPT3’s. Execution of the project is as critical for success as the model architecture.
BALK RULES! IMPORTANT! 1. You can’t just be up there and just doin’ a balk like that.
1a. A balk is when you
1b. Okay well listen. A balk is when you balk the
1c. Let me start over
1c-a. The pitcher is not allowed to do a motion to the, uh, batter, that prohibits the batter from doing, you know, just trying to hit the ball. You can’t do that.
1c-b. Once the pitcher is in the stretch, he can’t be over here and say to the runner, like, “I’m gonna get ya! I’m gonna tag you out! You better watch your butt!” and then just be like he didn’t even do that.
1c-b(1). Like, if you’re about to pitch and then don’t pitch, you have to still pitch. You cannot not pitch. Does that make any sense?
1c-b(2). You gotta be, throwing motion of the ball, and then, until you just throw it.
1c-b(2)-a. Okay, well, you can have the ball up here, like this, but then there’s the balk you gotta think about.
1c-b(2)-b. Fairuza Balk hasn’t been in any movies in forever. I hope she wasn’t typecast as that racist lady in American History X.
1c-b(2)-b(i). Oh wait, she was in The Waterboy too! That would be even worse.
1c-b(2)-b(ii). “get in mah bellah” – Adam Water, “The Waterboy.” Haha, classic…
1c-b(3). Okay seriously though. A balk is when the pitcher makes a movement that, as determined by, when you do a move involving the baseball and field of
2. Do not do a balk please.
[0]: https://justinbee.tumblr.com/post/15309101943/best-explanati...
Basically, Nvidia et al. don't want the AI research to move in a direction that requires less GPU compute, less training data, and less inference compute.
Someone on HN (I don't remember the name) mentioned that the idea of deep learning is backed by big tech because it benefits them the most as they are the only players in town with huge amounts of data. If the AI community would find entirely different approaches to AGI (maybe not even learning), who do you think would suffer the most from the implications?
If the fundings were funneled to research groups working on alternative approaches, maybe we'd see the same amount of progress in AI only using another approach.
The quadratic bottleneck is due to the lower bounds of exhaustive search.
The papers on this only ever seem to reference perplexity.
The fact it can append a word to "I'm going to the beach" that sounds good doesn't mean it is useful.
There is no free lunch, and this project hasn't shown that the costs are acceptable.
"I'm going to the beach" + house
Doesn't help if what you needed was
"I'm going to the beach" + tomorrow
I do hope that there is more information on the costs, or that they have disproven SETH soon.
Or maybe it does not pan out at all. We are still at the stage where people are throwing everything at the wall to see what sticks. Some promising ideas which work at small scale do not work at bigger.
The thing is that Mamba is not perfect. There's no neural architecture to rule them all, if you will. I think the bigger issue is that we more act like there is and get on bandwagons. Let me give a clearer example from the past so we can see. The predecessor to DDPM (the work that kicked off the diffusion model era) was published in 2015[0], only a year after GANs[1]. Diffusion then showed promise but why did DDPM come out in 2020[2]? Because everyone was working on GANs. GANs produced far better images and diffusion (still is) was a lot more computationally intensive. Plus, all the people working on these diffusion models were in the same camp as those working on Normalizing Flows and other density based models, and fewer people are interested in understanding density estimation.
So the entire problem is that the community hopped on a singular railroad for research direction. There was still working going on in that direction but it wasn't nearly getting the attention that GANs got. It's hard to know if things were blocked from publication because they weren't as good as GANs. I can say from personal experience I had a Flow publication blocked because reviewers were concerned with its quality compared to GANs (this was 2019/2020, this paper will never be published because now it is even hard to publish a GAN work).
So yes and no because there is certainly railroading happening but there are also real critiques to Mamba. But what people often forget is that it is incredibly naive to compare new methods to existing methods on a direct one-to-one comparison. You're comparing something that has hundreds of hours to thousands of hours from a handful to a few dozen eyes against works with millions of hours and millions of eyes. Evaluation is just a really fucking hard thing to do but it is easy to just look at some benchmarks, even if they don't mean much. This is a fairly generalization notion though, so take the lesson to heart. But Mamba seems a bit different than our diffusion/GAN story, in that it is getting more attention than diffusion did in the 2016-2019.
[0] https://arxiv.org/abs/1503.03585
There is no conspiracy again efficient training. Companies aren’t going to lower compute budgets with more efficiency.
All the top labs are increasing efficiency, but they are using that to get more out of their large runs not spend less. Most companies have a relatively fixed training budget for their large runs and are trying to get the most out of it, bot save money,
Mamba is actually being scaled up and tested across other fields(bio) at a rapid pace compared to other architectures
Fwiw, the OP isn't suggesting conspiracy. The notion is more about convergent thinking.
I think that's how also RWKV works.
Edit: Sounds like it is. https://openreview.net/pdf?id=AL1fq05o7H
In our case, we don't actually wait for a closed-form solution but instead compute the discrete representation (Equation 2)
Hope that helps!