Even though our artificial training efficiency is worse now, likely to stay worse because we want to trade efficiency for faster training, and because we want to cram more knowledge into our training data than a human would be exposed to, it still seems likely to me that we'll get within orders of magnitude of this sooner or later.
Even if our training efficiency topped out at a hundred times worse than a biological system, that would be the energy equivalent of <100 tons of diesel fuel. Compared to raising and educating a human (and also considering this training can the be utilized for billions of queries before it becomes obsolete) that strikes me as a very reasonable cost (especially compared to the amounts of energy we wasted on cryptocurrency mining without blinking an eye...)
You can study and reproduce with your own training data right?
We could pass laws requiring models to demonstrate their training sets irregardless of how the training is distributed; and conversely if this is a community-led project, those also have copyright issues to deal with (wikipedia for example).
I suspect there's also a problem in that, e.g. ten million student essays about different pages of Harry Potter can each in isolation be justified by the right to quote small fragments for critical purposes, but the collection together isn't because it quotes an entire book series.
Copyright is intended to reward investment in creative works by giving sole license to distribute. It is not intended to create a monopoly on knowledge about the work.
If I can ask an LLM (or person!) “what’s the first sentence in Harry Potter?” And then “what’s the second sentence?” and so on, that does not mean they are distributing the work in competition with the rights holders.
We have gone way overboard with IP protections. The purpose of copyright is served when Rowling buys her 10th mansion. We do not need to further expand copyright to make it illegal to learn from a work or to remember it after reading.
New Training Technique for Highly Efficient AI Methods (2 points, 5 hours ago) https://news.ycombinator.com/item?id=42690664
DiLoCo: Distributed Low-Communication Training of Language Models (46 points, 1 year ago, 14 comments) https://news.ycombinator.com/item?id=38549337
The second article you linked indicates there will still be intense bandwidth requirements during training, shipping around gradient differentials.
What has changed in the past year? Is this technique looking better, worse, or the same?
Federated learning breaks the barrier to entry and expands the ecosystem allowing more participants to share compute and/or datasets for small players to train models.
DiLoCo introduced by Douillard minimizes communication overhead by averaging weight updates. What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process. That's where DisTrO comes in which even reduces further the inter-GPU communication using a decoupling technique (DeMo) that only shares the fast moving parts of the optimizer across the GPU cluster.
>And what if the costs could drop further still? The dream for developers pursuing truly decentralised ai is to drop the need for purpose-built training chips entirely. Measured in teraflops, a count of how many operations a chip can do in a second, one of Nvidia’s most capable chips is roughly as powerful as 300 or so top-end iPhones. But there are a lot more iPhones in the world than gpus. What if they (and other consumer computers) could all be put to work, churning through training runs while their owners sleep?"
This aligns with DisTrO techniques because, according to them it could also allow consumer devices like Desktop Gaming PCs to join the compute cluster and share workloads. Besides there's also an open-source implementation called exo that allows models to be split among idle local devices but it's only limited to inference.
Again might still be relevant since in the article it mentions that DiLoCo was able to make the model respond better when faced with instruction prompts or reasoning questions never encountered during pre-training. And Arthur seems to think test-time training will make his approach become the norm.
sources: DisTrO: https://github.com/NousResearch/DisTrO DeMo: https://arxiv.org/pdf/2411.19870 Exo: https://github.com/exo-explore/exo
That's not exactly accurate. In the data parallel side of techniques, the Distributed Data Parallel (DDP) approach does require a fully copy of the model on each GPU. However there's also Fully Sharded Data Parallel (FSDP) which does not.
Similarly things like tensor parallelism (TP) split the model over GPUs, to the point where full layers are never in a single GPU anymore.
Combining multiple of the above is how huge foundation models are trained. Meta used 4d parallelism (FSDP + TP and pipeline/context parallelism) to train llama 405b.
I mean it reduces the communication overhead by more orders than DiLoCo.
I guess enormous is in the eye of the beholder.
However, in my naïvety, I wonder whether vastly simpler algorithms could be used to end up with similar results. Regular compression techniques work with speeds up to 700MB/s.
An LLM trained on the addition and multiplication data develops circuits for addition and multiplication[1].
It stands to reason that LLM trained on human-produced data develop algorithms that try to approximate the data production process (within their computational limits).
> However, in my naïvety, I wonder whether vastly simpler algorithms could be used to end up with similar results.
Almost certainly. Distillation demonstrates this. The difficulty is training. It's harder to train a smaller network and harder to train with less data. But look at humans, they ingest far less data and certainly less diverse data. We are extremely computationally efficient. I guess you have to be when you run on meatTrue in terms of text, but not if you include video, audio, touch etc. Sure, one could argue that there is much less information content in video than their raw bytes, but even so, we spend many years building a world model as we play with tools, exist in the world and go to school. I don't deny humans are more efficient learners but people tend to forget this. Also, children are taught things in ascending order of difficulty, while with LLMs we just throw random pieces of text at it. There is sure to be a lot of progress in curriculum learning for AI models.
[0] https://the-decoder.com/openai-co-founder-explains-the-secre...
(edit: I may also not be accounting enough for using a pre-trained general model next to a fine tuned specialized model?)
There was a distributed Protein Folding project a couple decades ago.
I remember there was even Protein folding apps that could run on game consoles when not playing games.
But maybe Protein Folding code is more Parallelizable across machines, than AI models.