That said, I am patiently waiting and champing at the bit for the day this isn't true anymore. Cool to see the groundwork being laid for it.
I think LLMs could end up the same way, if the comminity consolidates around a good one.
If there's a viable way to tune and run models locally they could still be useful if you don't need it to play chess and imitate a Python interpreter at the same time.
Agree that Ghostwriter is subpar though.
There are some advantages to not having to make an LLM that impresses every human being on the planet. Imagine training the AI to be good in only one specific thing. I think it will become much more precise and deterministic.
This is just my hypothesis. I'm excited to see where this goes.
That reminds me - I saw a somewhat-clever acronym variant for LLM that communicated this the other day but it escapes me ATM...
[0] - https://www.mosaicml.com/blog/introducing-pubmed-gpt
[1] - https://cloud.google.com/blog/topics/healthcare-life-science...
[2] - https://dev.to/reaminated/thoughts-on-bloomberggpt-and-domai...
Also I'm not talking about just 'prompt' output model. These ones are great and I'm sure they will be extremely impressive. However I'm talking more about being able to 'operate' something.
Imagine this, an AI able to operate some specific API in a deterministic / reliable way. I'm talking about complex operations.
SO the output is not so much a text prompt but an SOP and then actually Operating the SOP.
Imagine going into an app and say "can you boot up a cluster on AWS and run a wordpress site, point domain example.com to the site".
Imagine this "you know my database for app X, what was the latest snapshot", it replies with the date / time of the snapshot, and you reply with "can you move that snapshot from google cloud and create a new database from that snapshot on AWS cloud?", and it does it for you.
That's what I look forward to.
But I'm having trouble finding resources about how to achieve that.
Training these models from scratch on your domain specific data is not as expensive as one might think. We have provided some cost estimates in our blogs.
https://www.mosaicml.com/blog/mosaicbert
https://www.mosaicml.com/blog/training-stable-diffusion-from...
As you note, with the plethora of open/open-ish LLMs today and LoRA + PEFT you can fine tune with low VRAM and pretty quickly so even a single A100 or whatever cloud GPUs are just fine. I've even seen people pull it off in reasonable time on super cheap T4s, A10s, etc.
I doubt anyone reading a blog post is attempting to train a "true" multi-billion param LLM from scratch.
Most interesting is what happens between the preprocessing and the model training - the hand-off to the cluster workers.
I guess the efficient option is to partition the data, set up shards in-advance and ideally cache or even copy the data to local workers during init.
This, of course, breaks some of the promise of being able to scale training flexibly, for instance to experiment with the scaling of compute and data.
A different way to go about it is to use a streaming/iterable dataset/loader implementation with its own sharding logic that reads from a central store of parquets with some reasonable row-group size. This gives full flexibility in terms of node/gpu/worker/batch_size for experimentation - e.g. literally as parameters in PyTorch. Of course, one has to also implement caching of remote data since the data is kept centrally.
In my opinion, there is no satisfying/flexible solution for this, especially when one also wants to experiment with complex transformations or augmentations in the dataset/loader and remain portable across cloud offerings. So, this has to be implemented from scratch (not too difficult but still a lot of code). The coming datapipes also probably make this trivial.
Would love to hear more experiences in how you set this up!
Edit: I guess for NLP this is a good implementation and what Mosaic uses https://huggingface.co/docs/datasets/stream
In principle, that's great. But the reality is: whoever has the resources and benefits from something better will look for ways to get it. What they're communicating here is: the most resourceful developers on the planet aren't our ideal customer.
Reading through the entire story leaves me with a bad taste in my mouth, especially this bit: "still refused to list any specific part of Replit he thought I had copied, even when I asked him for such details multiple times during the phone call, despite his continuing to claim both privately and publicly that I copied Replit unethically".
I haven't used Replit, but reading about it and looking at riju.codes, I have a hard time believing that there was any secret sauce that was inappropriately used, and the sketchy refusal to give details makes me think it's more about a CEO establishing dominance over the little people than any serious IP concern.
I’m curious what he’s said or done to make you a fan?
I just want a URL in which I can run some code. https://riju.codes/ is literally that. Thanks!
Why?