Big labs like OpenAI and Deepmind have big clusters that support this kind of bursty allocation for their researchers, but startups so far have had to get very small clusters on very long term contracts, wait months of lead time, and try to keep them busy all the time.
Our goal is to make it about 10-20x cheaper to do an AI startup than it is right now. Stable Diffusion only costs about $100k to train -- in theory every YC company could get up to that scale. It's just that no cloud provider in the world will give you $100k of compute for just a couple weeks, so startups have to raise 20x that much to buy a whole year of compute.
Once the cluster is online, we're going to be pretty much the only option for startups to do big training runs like that on.
In 2023 you can barely get a single TPU for more than an hour. Back then you could get literally hundreds, with an s.
I believed in TRC. I thought they’d solve it by scaling, and building a whole continent of TPUs. But in the end, TPU time was cut short in favor of internal researchers — some researchers being more equal than others. And how could it be any other way? If I made a proposal today to get these H100s to train GPT to play chess, people would laugh. The world is different now.
Your project has a youthful optimism that I hope you won’t lose as you go. And in fact it might be the way to win in the long run. So whenever someone comes knocking, begging for a tiny slice of your H100s for their harebrained idea, I hope you’ll humor them. It’s the only reason I was able to become anybody.
This is the nicest thing anyone has said to us about this. We're gonna frame this and hang it out on our wall.
> So whenever someone comes knocking, begging for a tiny slice of your H100s for their harebrained idea, I hope you’ll humor them.
Absolutely! :D
I'm not encouraging the false belief that everything you do will work out. Instead I'm encouraging the realization that the greatest accomplishments almost always feel like long shots, and require significant amounts of optimism. Fear and pessimism, while helpful in appropriate doses, will limit you greatly in life if you let them rule you too significantly.
When I look back on my life, the greatest accomplishments I've achieved are ones where I was naive yet optimistic going into it. This was a good thing, because I would have been too scared to try had I really known the challenges that lay ahead.
Check out this list of recent TRC-supported publications: https://sites.research.google/trc/publications/
Demand for Cloud TPUs is definitely intense, so if you're using preemptible capacity, you're probably seeing more frequent interruptions, but reserved capacity is also available. Hope you email the TRC support team to say hello!
This may feel like an anime betrayal, since you basically launched my career as a scientist. But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem, especially today. And TRC just does not support them anymore. I tried, many times, over the last year and a half.
You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs
Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.
I held out hope for so long. I thought it was temporary. It ain’t temporary, Zak. And I vividly remember when it happened. Some smart person in google proposed a new allocation algorithm back near the end of 2021, and poof, overnight our ability to make TPUs went from dozens to a handful. It was quite literally overnight; we had monitoring graphs that flatlined. I can probably still dig them up.
I’ve wanted to email you privately about this, but given that I am a small fish in a pond that’s grown exponentially bigger, I don’t think it would’ve made a difference. The difference is in your last paragraph: you allocate reserved instances to those who deserve it, and leave everybody else to fight over 45 minutes of TPU time when it takes 25 minutes just to create and fill your TPU with your research data.
Your non-preemptible TPUs are frankly a lie. I didn’t want to drop the L word, but a TPUv3 in euw4a will literally delete itself — aka preempt — after no more than a couple hours. I tested this over many months. That was some time ago, so maybe things have changed, but I wouldn’t bet on it.
There’s some serious “left hand doesn’t know that right hand detached from its body and migrated south for the winter” energy in the TRC program. I don’t know where it embedded itself, but if you want to elevate any other engineers from software devs to researchers, I urge you to make some big changes.
One last thing. The support staff of TRC is phenomenal. Jonathan Colton has worked more miracles than I can count, along with the rest of his crew. Ultimately he had to send me an email like “by the way, TRC doesn’t delete TPUs. This distinction probably won’t be too relevant, but I wanted to let you know” (paraphrasing). Translation: you took the power away from the people who knew where to put it (Jonathan) and gave it to some really important researchers, probably in Brain or some other division of Google. And the rest is history. So I don’t want to hear that one of the changes is “ok, we’ve punished the support staff” - as far as I can tell, they’ve moved mountains with whatever tools they had available, and I definitely wouldn’t have been able to do any better in their shoes.
Also, hello. Thanks for launching my career. Sorry that I had to leave this here, but my duty is to the open source community. The good news is that you can still recover, if only you’d revert this silly “we’ll slip you some reserved TPUs that don’t kamikaze themselves after 45 minutes if you ask in just the right way” stuff. That wasn’t how the program was in 2019, and I guarantee that I couldn’t have done the work I did then under the current conditions.
The Googlers maintaining the TPU Github repo also just basically don't care about your PR unless it's somehow gonna help them in their own perf review.
In contrast with a GPU-based grid, you can not only run the latest & greatest out-of-the-box but also do a lot of local testing that saves tons of time.
Finally, the OP here appears to be offering real customer engagement, which is totally absent from my own GCloud experiences across several companies.
My understanding was that this situation changed drastically depending on what sort of email you had or how popular your Twitter handle was.
Are you affiliated with an academic institution? Otherwise I'm not sure why they're been more generous with me, my projects have been mildly interesting at best.
They're certainly a lot stingier with larger pods than they used to be though.
Oh come on, colab gives TPU access in the free tier for a whole half day. No need to exaggerate the shortage
Um. Can't you order them from coral.ai and put them in an NVMe slot? Or are the cloud TPUs more powerful?
In theory, this sounds almost identical to the business model behind AWS, Azure, and other cloud providers. "Instead of everyone buying a fixed amount of hardware for individual use, we'll buy a massive pool of hardware that people can time-share." Outside of cloud providers having to mark up prices to give themselves a net-margin, is there something else they are failing to do, hence creating the need for these projects?
1) Margins. Public cloud investors expect a certain margin profile. They can’t compete with Lambda/Fluidstack’s margins.
2) To an extent also big clouds have worse networking for LLM training. I believe only Azure has infiniband. Oracle is 3200 Gbps but not infiniband, same for AWS I believe. GCP not sure but their A100 networking speeds were only 100 Gbps I believe rather than 1600. Whereas lambda, fluidstack and coreweave all have ib.
3) Availability. Nvidia isn’t giving big clouds the allocation they want.
Secondly, there is a fundamental question of resource sharing here. Even with this project by Evan and AI Grant (the second such cluster created by AI Grant btw), the question will arise — if one team has enough money to provision the entire cluster forever, why not do it? What are the exact parameters of fair use? In networking, we have algorithms around bandwidth sharing (TCP Fairness, etc.) that encode sharing mechanisms but they don’t work for these kinds of chunky workloads either.
But over the next few months, AWS and others are working to release queueing services that let you temporarily provision a chunk of compute, probably with upfront payment, and at a high expense (perhaps above the on demand rate).
I would srgue this has always been a common case for cloud GPU compute
They want to do that themselves, and keep the customer relationship and the profits, instead of giving them to a middleman or the customer.
You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour. That’s just barely above the cost of the electricity and cooling!
What do you want, free compute just handed to you out of the goodness of their hearts?
There is incredible demand for high-end GPUs right now, and market prices reflect that.
A large part of the profit comes from the upfront risk of buying machines. With this you are just absorbing that risk which may be better if the startup expects to last.
I've never had to buy very large compute, but I thought that was the whole point of the cloud
Right now, it's pretty easy to get a few A/H100s (Lambda is great for this), but very hard to get more than 24 at a reasonable price ($~2 an hour). One often needs to put up a 6+ month commitment, even when they may only want to run their H100s for an 8 hour training run.
It's the right business decision for GPU brokers to do long term reservations and so on, and we might do so too if we were in their shoes. But we're not in their shoes and have a very different goal: arm the rebels! Let someone who isn't BigCorp train a model!
As a graduate student, thank you. Thankfully, my workloads aren't LLM crazy so I can get by on my old NVIDIA consumer hardware, but I have coworkers struggling to get reasonable prices/time for larger scale hardware.
No idea how capacity at lambdalabs actually looks like though. Does anyone have insight how easy it is to spin up more than 2-3 instances up there?
I have never seen a GPU crunch quite like it is right now. To anyone who is interested in hobbyist ML, I highly highly recommend using vast.ai
For H100s and A100s - lambda, fluidstack, runpod. Also coreweave and crusoe and oblivus and latitude
For non a/h100s: vast, Tensordock, also runpod here too
Also, many of the available options clearly are recycled crypto mining rigs which have somewhat odd configurations (poor gpu bandwidth, low cpu ram).
But for AI training? If the public cloud isn't competitive even for bursty AI training, their margins are much higher than I anticipated.
OP mentions 10-20x cost reduction? Compared to what? AWS?
[1] We have not gone the way of the Xerces blue [2] yet... we still exist!
But I do think a lot of our customers will be out here —- SF is still probably the best place to do startups. We just have so many more people doing hard technical stuff here. Literally every single place I’ve lived in SF there’s been another startup living upstairs or downstairs
Good idea to host some in person events!
now that's a hot take if I ever saw one
Make money off your GPU with vast.AI
> Ubuntu 18.04 or newer (required)
> Dedicated machines only - the machine shouldn't be doing other stuff while rented
well that's certainly not what I expected. ctrl-f "virtual" gives nothing, so it seems they really mean "take over your machine"
> Note: you may need to install python2.7 to run the install script.
what kind of nonsense is this? Did they write the script in 2001 and just abandon it?
Basically twitter devolves into the Colos of the late 90s :-)
-
For those who didnt notice, it was tongue in cheek.
Neither Alex or I are currently VCs, and this has no affiliation with any venture fund.
We want to be a customer of the sf compute group too!
Perhaps its also a way for freshly applying grad students to look at a university looking to do research in LLMs that requires scale...
554 5.7.1 <alex@sfcompute.org>: Relay access denied
Price and market depth are very different things
When was the last time you spoke to a chatbot?
Where will the cluster be hosted ?
May I suggest that you get your IP transit from he.net ?
Who is funding this?
Cause if it’s VC then it’s going to have the same fate as everything else after 5-7 years.
I hope y’all have as innovative of a business model. You’ll need it if you want to do what you’re doing now for more than a few years
Not everything has grow to have the appetite of Galactus and swallow a whole planet. Making single digit millions of dollars over a couple of years is still worthwhile, especially if it helps others and moves humanity forwards.
This project isn't ever going to want to try and compete with AWS, so no, it's not a billion dollar question. $20 Million, yeah.
That’s why I’m asking because a “bootstrapped” company like you describe has a future…
One backed by VC doesn’t
I mean they may have a future but not like you describe
Is it accurate to say you’re willing to go into ~20,000,000 USD debt to sell discounted computer-as-a-service to researchers/startups, but unwilling to go into debt to sponsor the undergraduate degrees of ~100-500 students at top-tier schools? (40k - 200k USD per degree)
Or, you know, build and fund a small public school/library or two for ~5 years?