The fact is large language models require a lot of VRAM, and the more interesting ones need more than 24GB to run.
The people who are able to afford systems with more than 24GB VRAM will go buy hardware that gives them that, and when GPU vendors release products with insufficient VRAM they limit their market.
I mean inequality is definitely increasing at a worrying rate these days, but let's keep the discussion on topic...
i learned my RAM lesson when I bought my first real linux PC. it had 4MB of RAM, which was enough to run X, bash, xterm, and emacs. But once I ran all that and also wanted to compile with g++, it would start swapping, which in the days of slow hard drives, was death to productivity.
I spent $200 to double to 8MB, and then another $200 to double to 16MB, and then finally, $200 to max out the RAM on my machine-- 32MB! And once I did that everything flew.
Rather than attempting to solve the problem by making emacs (eight megs and constantly swapping) use less RAM, or find a way to hack without X, I deployed money to max out my machine (which was practical, but not realistically available to me unless I gave up other things in life for the short term). Not only was I more productive, I used that time to work on other engineering problems which helped build my career, while also learning an important lesson about swapping/paging.
People demand RAM and what was not practically available is often available 2 years later as standard. Seems like a great approach to me, especially if you don't have enough smart engineers to work around problems like that (see "How would you sort 4M integers in 2M of RAM?")
Thank you. Now I feel a log better for dropping $700 on the 32MB of RAM when I built my first rig.
It is possible that compressing and using all of human knowledge takes a lot of memory and in some cases the accuracy is more important than reducing memory usage.
For example [1] shows how Gemma 2B using AVX512 instructions could solve problems it couldn't solve using AVX2 because of rounding issues with the lower-memory instructions. It's likely that most quantization (and other memory reduction schemes) have similar problems.
As we develop more multi-modal models that can do things like understand 3D video in better than real time it's likely memory requirements will increase, not decrease.
It is just that there's a limit to how much you can compress the models.
The people training 70B parameter models from scratch need ~600GB of VRAM to do it!
There are millions (billions?) of dollars at stake here, and obviously the best minds are already tackling the problem. Only plebs like us who don't have the skills to do so bicker on an internet forum... It's not like we could realistically spend the time inventing ways to run inference with fewer resources and make significant headway.