Show HN: San Francisco Compute – 512 H100s at <$2/hr for research and startups (opens in new tab)

(sfcompute.org)

727 pointsflaque2y ago176 comments

Hey folks! We're Alex and Evan, and we're working on putting together a 512 H100 compute cluster for startups and researchers to train large generative models on. - it runs at the lowest possible margins (<$2.00/hr per H100) - designed for bursty training runs, so you can take say 128 H100s for a week - you don’t need to commit to multiple years of compute or pay for a year upfront

Big labs like OpenAI and Deepmind have big clusters that support this kind of bursty allocation for their researchers, but startups so far have had to get very small clusters on very long term contracts, wait months of lead time, and try to keep them busy all the time.

Our goal is to make it about 10-20x cheaper to do an AI startup than it is right now. Stable Diffusion only costs about $100k to train -- in theory every YC company could get up to that scale. It's just that no cloud provider in the world will give you $100k of compute for just a couple weeks, so startups have to raise 20x that much to buy a whole year of compute.

Once the cluster is online, we're going to be pretty much the only option for startups to do big training runs like that on.

176 comments

sillysaurusx2y ago

I hope you succeed. TPU research cloud (TRC) tried this in 2019. It was how I got my start.

In 2023 you can barely get a single TPU for more than an hour. Back then you could get literally hundreds, with an s.

I believed in TRC. I thought they’d solve it by scaling, and building a whole continent of TPUs. But in the end, TPU time was cut short in favor of internal researchers — some researchers being more equal than others. And how could it be any other way? If I made a proposal today to get these H100s to train GPT to play chess, people would laugh. The world is different now.

Your project has a youthful optimism that I hope you won’t lose as you go. And in fact it might be the way to win in the long run. So whenever someone comes knocking, begging for a tiny slice of your H100s for their harebrained idea, I hope you’ll humor them. It’s the only reason I was able to become anybody.

flaqueOP2y ago

> Your project has a youthful optimism that I hope you won’t lose as you go. And in fact it might be the way to win in the long run.

This is the nicest thing anyone has said to us about this. We're gonna frame this and hang it out on our wall.

> So whenever someone comes knocking, begging for a tiny slice of your H100s for their harebrained idea, I hope you’ll humor them.

Absolutely! :D

camhart2y ago

Optimism is (almost) always required in order to accomplish anything of significance. Those who lose it, aren't living up to their potential.

I'm not encouraging the false belief that everything you do will work out. Instead I'm encouraging the realization that the greatest accomplishments almost always feel like long shots, and require significant amounts of optimism. Fear and pessimism, while helpful in appropriate doses, will limit you greatly in life if you let them rule you too significantly.

When I look back on my life, the greatest accomplishments I've achieved are ones where I was naive yet optimistic going into it. This was a good thing, because I would have been too scared to try had I really known the challenges that lay ahead.

Frost1x2y ago

>Optimism is (almost) always required in order to accomplish anything of significance. Those who lose it, aren't living up to their potential.

I argue that realism trumps optimism. It's perfectly normal in a realist farming to see something difficult, acknowledge the high risk and failure potential, and still pursue something with intent to succeed.

I've personally grown tired of over optimism everywhere because it creates unrealistic situations and passes consequences of failure in an inequitable way. The "visionary" is rewarded when the rare successes occur, while everyone else suffers the consequences for most failures. No contingency plans for failure, no discussion of failure, and so on. Optimism just takes any idea, pursues it and consequences be someone else's problem and be damned.

Pessimism isn't much better, you essentially think everything is too risky or unlikely to succeed so you never do anything. You live in a state of inaction because any level of risk or uncertainty is too much.

To me, realism is much better. You acknowledge the challenge. You acknowledge the risk. You make sure everyone involved understands it, but you still charge forward knowing you might succeed. Some think if you're not naively optimistic (what most people in my experience refer to as "optimism") you don't create enough pressure. I think that's non-sense.

4 more replies

dhash2y ago

YC startup founder here,

Mostly agree, except the market is not an optimistic place — it’s the market.

There are a multitude of reasons you lose your optimism, mostly because people take it away — your optimism is their money

johnthewise2y ago

I like this quote from Napoleon on taking risks: “If the art of war were nothing but the art of avoiding risks, glory would become the prey of mediocre minds.... I have made all the calculations; fate will do the the rest."

jacquesm2y ago

To me the payoff of failed projects is in what I learned. As long as that's the case I can carry my optimism over into new projects.

hackernewds2y ago

What a beautiful and articulate thought. thank you

zak2y ago

Actually, the TPU Research Cloud program is still going strong! We've expanded the compute pool significantly to include Cloud TPU v4 Pod slices, and larger projects still use hundreds of chips at a time. (TRC capacity has not been reclaimed for internal use.)

Check out this list of recent TRC-supported publications: https://sites.research.google/trc/publications/

Demand for Cloud TPUs is definitely intense, so if you're using preemptible capacity, you're probably seeing more frequent interruptions, but reserved capacity is also available. Hope you email the TRC support team to say hello!

sillysaurusx2y ago

Zak, I love you buddy, but you should have some of your researchers try to use the TRC program. They should pretend to be a nobody (like I was in 2019) and try to do any research with the resources they’re granted. I guarantee you those researchers will all tell you “we can’t start any training runs anymore because the TPUs die after 45 minutes.”

This may feel like an anime betrayal, since you basically launched my career as a scientist. But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem, especially today. And TRC just does not support them anymore. I tried, many times, over the last year and a half.

You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

I held out hope for so long. I thought it was temporary. It ain’t temporary, Zak. And I vividly remember when it happened. Some smart person in google proposed a new allocation algorithm back near the end of 2021, and poof, overnight our ability to make TPUs went from dozens to a handful. It was quite literally overnight; we had monitoring graphs that flatlined. I can probably still dig them up.

I’ve wanted to email you privately about this, but given that I am a small fish in a pond that’s grown exponentially bigger, I don’t think it would’ve made a difference. The difference is in your last paragraph: you allocate reserved instances to those who deserve it, and leave everybody else to fight over 45 minutes of TPU time when it takes 25 minutes just to create and fill your TPU with your research data.

Your non-preemptible TPUs are frankly a lie. I didn’t want to drop the L word, but a TPUv3 in euw4a will literally delete itself — aka preempt — after no more than a couple hours. I tested this over many months. That was some time ago, so maybe things have changed, but I wouldn’t bet on it.

There’s some serious “left hand doesn’t know that right hand detached from its body and migrated south for the winter” energy in the TRC program. I don’t know where it embedded itself, but if you want to elevate any other engineers from software devs to researchers, I urge you to make some big changes.

One last thing. The support staff of TRC is phenomenal. Jonathan Colton has worked more miracles than I can count, along with the rest of his crew. Ultimately he had to send me an email like “by the way, TRC doesn’t delete TPUs. This distinction probably won’t be too relevant, but I wanted to let you know” (paraphrasing). Translation: you took the power away from the people who knew where to put it (Jonathan) and gave it to some really important researchers, probably in Brain or some other division of Google. And the rest is history. So I don’t want to hear that one of the changes is “ok, we’ve punished the support staff” - as far as I can tell, they’ve moved mountains with whatever tools they had available, and I definitely wouldn’t have been able to do any better in their shoes.

Also, hello. Thanks for launching my career. Sorry that I had to leave this here, but my duty is to the open source community. The good news is that you can still recover, if only you’d revert this silly “we’ll slip you some reserved TPUs that don’t kamikaze themselves after 45 minutes if you ask in just the right way” stuff. That wasn’t how the program was in 2019, and I guarantee that I couldn’t have done the work I did then under the current conditions.

zak2y ago

A few quick comments:

> But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem

Totally agree! This was a big part of my original motivation for creating the TPU Research Cloud program. People sometimes assume that e.g. an academic affiliation is required to participate, but that isn't true; we want the program to be as open as possible. We should find a better way to highlight the work of TRC tinkerers - for now, the GitHub and Hugging Face search buttons near the top of https://sites.research.google/trc/publications/ provide some raw pointers.

I'm sorry to hear that you've personally had a hard time getting TPU v3 capacity in europe-west4-a. In general, TRC TPU availability varies by region and by hardware generation, and we've experimented with different ways of prioritizing projects. It's possible that something was misconfigured on our end if your TPU lifetimes were so short. Could you email Jonathan the name of the project(s) you were using and any other data you still have handy so we can figure out what was going wrong?

Also, thanks for the kind words for Jonathan and the rest of the TRC team. They haven't lost any power or control, and they are allocating a lot more Cloud TPU capacity than ever. However, now that everyone wants to train LLMs, diffusion models, and other exciting new things, demand for TPU compute is way up, so juggling all of the inbound TRC requests is definitely more challenging than it used to be.

1 more reply

nl2y ago

> You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

> Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

Unless I'm misreading this they sound pretty happy and you sound pessimistic? Their last substantial comment was "I'm sure Zak could hook you up with something better"?

1 more reply

choppaface2y ago

Main problem with the TPU Research Cloud is you get dragged down a LOT by the buggy TPU API-- not just the Google Cloud API being awful but the Tensorflow/Jax/Pytorch support also being awful too. You also basically must use Google Cloud Storage, which is also slow and can be really expensive getting anything into / out-of.

The Googlers maintaining the TPU Github repo also just basically don't care about your PR unless it's somehow gonna help them in their own perf review.

In contrast with a GPU-based grid, you can not only run the latest & greatest out-of-the-box but also do a lot of local testing that saves tons of time.

Finally, the OP here appears to be offering real customer engagement, which is totally absent from my own GCloud experiences across several companies.

zak2y ago

Could you share a few technical details about the issues you've encountered with TF / JAX / PyTorch on Cloud TPUs? The overall Cloud TPU user experience improved a whole lot when we enabled direct access to TPU VMs, and I believe the newer JAX and PyTorch integrations are improving very rapidly. I'd love to know which issues are currently causing the most friction.

ShamelessC2y ago

Wow! I never thought you’d see the light. All I ever see from your posts is praise for TRC. As someone who got started way later on, I had infinitely more success with a gaming GPU I owned myself. Obviously not really comparable, but TRC was very very difficult to work with. I think I only ever had access to a TPUv3 once and that wasn’t nearly enough time to learn the ropes.

My understanding was that this situation changed drastically depending on what sort of email you had or how popular your Twitter handle was.

haldujai2y ago

My experience has been different. Considering how easy the application is I think they're still being fairly generous as I've been offered multiple v3-8s and v3-32s x 30days as well as pre-emptible v3-64s x 28 days for a few different projects within the last 6 months.

Are you affiliated with an academic institution? Otherwise I'm not sure why they're been more generous with me, my projects have been mildly interesting at best.

They're certainly a lot stingier with larger pods than they used to be though.

latchkey2y ago

What Shawn says is absolutely right. The race right now is way too hot for this stuff. A single customer will eat up 512 gpus for 3 years.

nwoli2y ago

> In 2023 you can barely get a single TPU for more than an hour

Oh come on, colab gives TPU access in the free tier for a whole half day. No need to exaggerate the shortage

LoganDark2y ago

> In 2023 you can barely get a single TPU for more than an hour.

Um. Can't you order them from coral.ai and put them in an NVMe slot? Or are the cloud TPUs more powerful?

whimsicalism2y ago

TPU pod is not sold by google, edge tpu is different

LoganDark2y ago

So the cloud TPUs are more powerful...? Or what are you saying?

4 more replies

whack2y ago

> Rather than each of K startups individually buying clusters of N gpus, together we buy a cluster with NK gpus... Then we set up a job scheduler to allocate compute

In theory, this sounds almost identical to the business model behind AWS, Azure, and other cloud providers. "Instead of everyone buying a fixed amount of hardware for individual use, we'll buy a massive pool of hardware that people can time-share." Outside of cloud providers having to mark up prices to give themselves a net-margin, is there something else they are failing to do, hence creating the need for these projects?

tikkun2y ago

Couple things, mostly pricing and availability:

1) Margins. Public cloud investors expect a certain margin profile. They can’t compete with Lambda/Fluidstack’s margins.

2) To an extent also big clouds have worse networking for LLM training. I believe only Azure has infiniband. Oracle is 3200 Gbps but not infiniband, same for AWS I believe. GCP not sure but their A100 networking speeds were only 100 Gbps I believe rather than 1600. Whereas lambda, fluidstack and coreweave all have ib.

3) Availability. Nvidia isn’t giving big clouds the allocation they want.

bravura2y ago

What is your differentiator from Lambda? That you are smaller and in a single DC?

Sincere question.

tikkun2y ago

I'm not OP/submitter, but the main differentiator is that Lambda doesn't have on-demand availability for lots of interlinked H100s - you have to reserve them.

Lambda has "Lambda Sprint" which is kinda similar,[1] but Sprint is $4.85/GPU/hr instead of <$2.

So if you want 128 GPUs for a week, you can't use Lambda reserved (3 year term), you can't use Lambda on-demand (can't get 128 A/H100s on-demand), your options are Lambda Sprint or SF Compute, and SF Compute is offering significantly lower prices.

[1]: https://lambdalabs.com/service/gpu-cloud/reserved

TylerE2y ago

Low margins and “will this thing still be around in 2 years” are negatively correlated.

Where’s the capital for upgrades, repairs, and replacements coming from?

littlestymaar2y ago

Using investor's money to build something with low to zero margin until you capture enough value to make it profitable a few years down the line has been the core SV strategy for more than a decade now, so it's not an extraordinary plan.

Of course it doesn't always work, and it may be even harder to make it work in the current macroeconomic environment, but it's still pretty standard play.

aabhay2y ago

They are working on this. All the major clouds have initiatives to do short term requests/reservations. It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

Secondly, there is a fundamental question of resource sharing here. Even with this project by Evan and AI Grant (the second such cluster created by AI Grant btw), the question will arise — if one team has enough money to provision the entire cluster forever, why not do it? What are the exact parameters of fair use? In networking, we have algorithms around bandwidth sharing (TCP Fairness, etc.) that encode sharing mechanisms but they don’t work for these kinds of chunky workloads either.

But over the next few months, AWS and others are working to release queueing services that let you temporarily provision a chunk of compute, probably with upfront payment, and at a high expense (perhaps above the on demand rate).

whimsicalism2y ago

> It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

I would srgue this has always been a common case for cloud GPU compute

beachy2y ago

AWS and Azure would slit their own throats before they created a way for their customers to pool instances to save money.

They want to do that themselves, and keep the customer relationship and the profits, instead of giving them to a middleman or the customer.

jiggawatts2y ago

It’s just corporate profits combined with market forces, not a some sort of malicious conspiracy.

You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour. That’s just barely above the cost of the electricity and cooling!

What do you want, free compute just handed to you out of the goodness of their hearts?

There is incredible demand for high-end GPUs right now, and market prices reflect that.

beachy2y ago

You mentioned malicious conspiracy, not me.

It's just business and I'd do the same if I was in charge of AWS.

alex_lav2y ago

> You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour.

Source required

1 more reply

mikeravkine2y ago

Sorry where are these .50c many core servers you speak of exactly?

1 more reply

megakwood2y ago

Where can you get 120 cores for $2/hr?

1 more reply

asdfaoeu2y ago

AWS and Azure both charge by the hour anyway so it wouldn't but if you wanted you could use Reserved instances and just have their accounts in the same organisation.

A large part of the profit comes from the upfront risk of buying machines. With this you are just absorbing that risk which may be better if the startup expects to last.

bnr4u2y ago

Having hosted infrastructure in CA at multiple colos. I would advise you to host it elsewhere if you can, cost of power, other infrastructure is much higher in CA than AZ or NV.

PettingRabbits2y ago

Montreal would be the place to go for cheap power, and the CAD-USD advantage.

ec1096852y ago

Power seems like a very small amount of cost of compute when it comes to GPU’s.

version_five2y ago

FWIW I tired to look up some numbers, i found California "industrial" electricity at $0.18/Kwh https://www.eia.gov/electricity/monthly/epm_table_grapher.ph... and H100s using 300-700w https://www.nvidia.com/en-us/data-center/h100/ which implies a worst case marginal cost of .18*.7 = $.126 / gpu / hour. Looks like Montana is cheapest at ~$.05 / kwh which would bring that down to $.035. So there may be about a $0.09 California premium (vs the absolute cheapest possibility), which as you say is a small amount of the total cost, but could be material for large workloads.

jedberg2y ago

Retail residential power in the city of Santa Clara is $0.15/KwH, I'm sure commercial could be less. Especially if you throw some solar panels on the roof.

The most expensive part would be the land, but honestly there is some pretty cheap land outside the cities.

2 more replies

wongarsu2y ago

$0.09 for the GPU alone. Add power for mainboard, RAM, and fans, efficiency loss at the power supply, networking, etc. After that another flat 30% for HVAC, since all that "consumed" electricity got turned into heat and the heat has to go somewhere.

And when we are talking about low margins, a 5-10% difference in cost is very significant.

ec1096852y ago

Meanwhile, AWS is charging $8 an hour for their top of the line gpu server.

hackernewds2y ago

over regulation and taxes

wodenokoto2y ago

> It's just that no cloud provider in the world will give you $100k of compute for just a couple weeks

I've never had to buy very large compute, but I thought that was the whole point of the cloud

williamstein2y ago

How does this compare to https://lambdalabs.com/ ?

flaqueOP2y ago

Ah, we're running a medium amount of compute at zero-margin. The point is not to go sell the Fortune 500, but to make sure a grad student can spend a $50k grant.

Right now, it's pretty easy to get a few A/H100s (Lambda is great for this), but very hard to get more than 24 at a reasonable price ($~2 an hour). One often needs to put up a 6+ month commitment, even when they may only want to run their H100s for an 8 hour training run.

It's the right business decision for GPU brokers to do long term reservations and so on, and we might do so too if we were in their shoes. But we're not in their shoes and have a very different goal: arm the rebels! Let someone who isn't BigCorp train a model!

trostaft2y ago

> but to make sure a grad student can spend a $50k grant.

As a graduate student, thank you. Thankfully, my workloads aren't LLM crazy so I can get by on my old NVIDIA consumer hardware, but I have coworkers struggling to get reasonable prices/time for larger scale hardware.

narrator2y ago

So what happens when some big bucks VC backed closed source LLM company buys all your compute inventory for the next 5 years? This is not that unlikely. Lambda Labs a little while back was completely sold out of all compute inventory.

xeromal2y ago

I assume it's up to them to say no. They did say they're not in it to make bookoo bucks

2 more replies

gnopgnip2y ago

Presumably them buy more gpus

ec1096852y ago

How can you allow people to get big chunks of GPU’s without a lot of expensive slack in the system?

lulunananaluna2y ago

This is great. Thank you very much for your work.

wongarsu2y ago

Very similar price, but from what I gather very different model. One important difference might be if you regularly run short-ish training runs over many GPUs. Lambdalabs might not have 256 instances to give you right now. With OP you are basically buying the right to put jobs in the job queue for their 512 GPU cluster, so running a job that needs 256 GPUs isn't an issue (though you might wait behind someone running a 512 GPU job).

No idea how capacity at lambdalabs actually looks like though. Does anyone have insight how easy it is to spin up more than 2-3 instances up there?

agajews2y ago

Yeah it’s pretty hard to find a big block of GPUs that you can use for a short time, esp if you need infiniband for multinode training. Lambda I think needs a min reservation of 6-12 months if you want IB.

jorlow2y ago

You can usually only get a few h100s at a time unless you're committed to reserved instances (for a longer time period)

ivalm2y ago

No real way to get a big block without commitment. Iirc smallest h100 commitment is 64gpus for 3 years (about $3M usd).

theptip2y ago

My question too. At $2/hr for H100 that seems more flexible? But I haven’t tried to get 10k GPU-hours on any of these services, maybe that is where the bottleneck is.

whimsicalism2y ago

I am super interested in AI on a personal level and have been involved for a number of years.

I have never seen a GPU crunch quite like it is right now. To anyone who is interested in hobbyist ML, I highly highly recommend using vast.ai

tikkun2y ago

Additional clouds:

For H100s and A100s - lambda, fluidstack, runpod. Also coreweave and crusoe and oblivus and latitude

For non a/h100s: vast, Tensordock, also runpod here too

quickthrower22y ago

Depends on what you class as hobbyist but I am running a T4 for a few minutes to get acquainted with tools and concepts and I found modal.com really good for this. They resell AWS and GCP at the moment. They also have A100 but T4 is all I need for now.

whimsicalism2y ago

Significantly more expensive than equivalent 3090 configuration if you can do model parallelism

quickthrower22y ago

What do you mean by this? I use less than the $30/m free included usage.

I am guessing you mean at some point just buy your own 3090 as it will be cheaper than paying a cloud per second for a server-grade Nvidia setup.

1 more reply

williamstein2y ago

Many thanks for posting about vast.ai, which I had never heard of! It's a sort of "gig economy/marketplace" for GPU's. The first machine I tried just now worked fine, had 512GB of RAM, 256 AMC CPUs, an A100 GPU, and I got about 4 minutes for $0.05 (which they provided for free).

whimsicalism2y ago

The only caveat is it is not really appropriate for private usecases.

Also, many of the available options clearly are recycled crypto mining rigs which have somewhat odd configurations (poor gpu bandwidth, low cpu ram).

dudus2y ago

I know AWS/GCP/Azure have overhead and I understand why so many companies choose to go bare metal on their ops. I personally rarely think it's worth the time and effort, but I get that with scale saving can be substantial.

But for AI training? If the public cloud isn't competitive even for bursty AI training, their margins are much higher than I anticipated.

OP mentions 10-20x cost reduction? Compared to what? AWS?

jerjerjer2y ago

AWS offers p5.48xlarge which is 8xH100 for $98.32, so 12.29$ per hour per H100 - ~6x the price.

kaycebasques2y ago

Hi, SF lover [1] here. Anything interesting to note about your name? Will your hardware actually be based in SF? Any plans to start meetups or bring customers together for socializing or anything like that?

[1] We have not gone the way of the Xerces blue [2] yet... we still exist!

[2] https://en.wikipedia.org/wiki/Xerces_blue

agajews2y ago

Ah the hardware isn’t gonna be in SF (not the cheapest datacenter space)

But I do think a lot of our customers will be out here —- SF is still probably the best place to do startups. We just have so many more people doing hard technical stuff here. Literally every single place I’ve lived in SF there’s been another startup living upstairs or downstairs

Good idea to host some in person events!

menthe2y ago

> SF is still probably the best place to do startups.

now that's a hot take if I ever saw one

nilsbunger2y ago

I love the idea of community assets. could it be the start of a GPU co-op?

fragmede2y ago

For consumer-grade cards, that's already here.

Make money off your GPU with vast.AI

https://cloud.vast.ai/host/setup

mdaniel2y ago

> Requirements

> Ubuntu 18.04 or newer (required)

> Dedicated machines only - the machine shouldn't be doing other stuff while rented

well that's certainly not what I expected. ctrl-f "virtual" gives nothing, so it seems they really mean "take over your machine"

> Note: you may need to install python2.7 to run the install script.

what kind of nonsense is this? Did they write the script in 2001 and just abandon it?

mschuster912y ago

> what kind of nonsense is this? Did they write the script in 2001 and just abandon it?

Anything AI/ML is a hot mess of cobbled-together bits and pieces of Python barely holding together. I recently read somewhere that there should be a new specialization of "ML DevOps Engineer"... and hell I'm supporting that.

2 more replies

williamstein2y ago

I just skimmed their FAQ at https://vast.ai/faq, and it seems like it could use an update. E.g., it says "Initially we are supporting Ubuntu Linux, more specifically Ubuntu 16.04 LTS.". That version of Ubuntu has been end-of-life'd for several years, and when I just tried vast.ai out, it seemed to be using Ubuntu 20.04. There were also a couple of words with letters missing (probably trivial typos) that could be found with a spell checker. The questions in their FAQ are really interesting though, in terms of highlighting what users care about (e.g., there's a lot devoted to "how do I use vast.ai + google colab together"?). I also wonder when vast.ai started? Sometimes you can get insight from a company blog page, but the vast.ai blog seems to start in Feb 2023: https://vast.ai/blog . There's a bunch of "personal experiences" with vast.ai from 3 years ago in this discussion though: https://www.reddit.com/r/MachineLearning/comments/hv49pd/d_c...

A comment in that discussion mentions yet another competitor in this space that I've never heard of: https://www.qblocks.cloud/ -- I just tried Q blocks out and the new user experience wasn't as good for me as with vast.ai: you have to put in $10 money to try it, instead of getting to try it initially for free; there is a manual approval process before you can try data center class GPUs; you only see that your instance is in Norway (say) after you try to start it, not before; it seems like there's no ssh access, and they only provide Jupyter to connect; neither pytorch nor tensorflow seemed to be installed. They could probably update their pages too, e.g., https://www.qblocks.cloud/vision is all about crypto mining and smartphones, which feels a bit dated... :-)

1 more reply

orf2y ago

Seems like it: https://s3.amazonaws.com/vast.ai/install

lgats2y ago

check here to see the current bid prices / gpu setups https://cloud.vast.ai/create/

PartiallyTyped2y ago

My computer is sitting mostly idle at home, thanks for this.

samstave2y ago

Serious Q, as I dont know Twitters internal infra at all... but with a shrinking in revenue from ads, or maybe less engagement by users, and the influx of Threads - maybe twitter can use from slices of its infra (even if its rack space, VMs, Containers, connectivity, who knows what, to support startups such as this?

Basically twitter devolves into the Colos of the late 90s :-)

For those who didnt notice, it was tongue in cheek.

version_five2y ago

I've generally tried to give Twitter the benefit of the doubt but I would never trust them as an infrastructure provider in their current incarnation. Reliability and consistency have been so far from their focus.

aionaiodfgnio2y ago

Would you really trust a company that doesn't pay its rent to run your infrastructure?

mike_d2y ago

Generally when you just stop paying your bills the datacenter holds your hardware and eventually auctions it off to cover some of your debt. I seriously doubt Twitter has any access to the two of three datacenters Elon decided to not pay for.

moneycantbuy2y ago

How did you get the money to buy 512 H100s?

taminka2y ago

ask no questions hear no lies

rvnx2y ago

EDIT: They seem to be in a raising fund / debt stage. Great initiative

williamstein2y ago

Their announcement says "We can probably get a good deal from a bank [...]", so maybe they don't just have 20M USD sitting around.

1 more reply

herval2y ago

unrelated to this specific initiative, but - I keep seeing a lot of announcements of huge VC rounds around what's effectively datacenters for GPUs. Curious about the math behind that - I feel like those things get obsolete so fast, it's almost like the whole scooter rental thing, where the unit economics doesn't add up.

Anyone have an insight?

humanistbot2y ago

From sentence one of the post, it clearly states that they are VC funders who are doing this for a round of startups they just funded, and they're looking for others to be a part of it.

flaqueOP2y ago

Oh no, definitely not. We just got a loan.

Neither Alex or I are currently VCs, and this has no affiliation with any venture fund.

We want to be a customer of the sf compute group too!

opportune2y ago

How’d you get this loan? Is it from a benevolent individual who just wants to make something happen?

If not and you got the loan from a bank, super curious how you were able to get the bank to trust that renting out the GPUs would cover the loan or if some other reasoning convinced them. Assuming you aren’t trying to turn this into a big business, that knowledge might help a lot of other players run similar programs and further democratize SOA GPU access.

1 more reply

xwdv2y ago

I’m curious, how are those loans guaranteed?

1 more reply

itissid2y ago

Noob Thought: So this would be a blue print on how a mid tier universities with older large compute cluster ops could do things in 2023 to support large LLM research?

Perhaps its also a way for freshly applying grad students to look at a university looking to do research in LLMs that requires scale...

itissid2y ago

Like to clarify, a new grad students could look at the current group and ask "Hey I know you are working on LLMs, but how many $$ of your grant are dedicated to how many TPU hours per grad student?"

latchkey2y ago

554 5.7.1 <evan@sfcompute.org>: Relay access denied

554 5.7.1 <alex@sfcompute.org>: Relay access denied

flaqueOP2y ago

!!!!!! fixing this. For the moment, evan at roomservice dot dev

ranting-moth2y ago

Ah, putting out flames live on HN. Back in the day it was on IRC or just on the phone with the customer. I miss those times.

fragmede2y ago

fwiw, https://roomservice.dev/ is currently a 404

flaqueOP2y ago

Ah yeah, that's normal! Was from my old CRDT company, and works as a good emergency email while we debug our DNS.

2 more replies

latchkey2y ago

http != smtp

  roomservice.dev. 60 IN MX 5 alt1.aspmx.l.google.com.
  roomservice.dev. 60 IN MX 5 alt2.aspmx.l.google.com.
  roomservice.dev. 60 IN MX 1 aspmx.l.google.com.
  roomservice.dev. 60 IN MX 10 alt3.aspmx.l.google.com.
  roomservice.dev. 60 IN MX 10 alt4.aspmx.l.google.com.
  roomservice.dev. 60 IN MX 15 4ig53n4pw7p3cuxm7n7xi7dpuyq6722aipexvhkngzbd2e4mudmq.mx-verification.google.com.

2 more replies

latchkey2y ago

done

sashank_15092y ago

Correct me if I’m wrong but doesn’t Lambda Labs already provide them at 1.89$? What’s the point if you’re starting out not the cheapest

agajews2y ago

Ah that’s only if you pay for 3 years of compute upfront. Most startups, especially the small ones, really can’t afford that

davidmurphy2y ago

Looks like their site is quoting a rate of $1.99 now https://lambdalabs.com/

version_five2y ago

See this post above: https://news.ycombinator.com/item?id=36935032

Price and market depth are very different things

mackid2y ago

Nat Friedman and Daniel Gross setup a 2,512 H100 cluster [1] for their startups, with a very similar “shared” model. Might be interesting to connect with them.

[1] https://andromedacluster.com/

flaqueOP2y ago

Nat & Daniel’s cluster is great, and we fully recommend startups seek out this option as well. Nat & Daniel are some of the best investors one can have

metadat2y ago

Will it be a Slurm cluster, or what kind of scheduler is SFC planning to use?

ucarion2y ago

Wishing y'all the best of luck. This would be huge for a lot of folks.

PettingRabbits2y ago

What kind of hardware setup are you planning out? Colocation, roll-your-own data center, something in between? Any thoughts on what servers the GPUs will be housed in?

netcraft2y ago

Honest question I don’t know how to consider: are we further along or behind with AI given crypto’s use of GPUs? Has the same cards bought for mining furthered AI, or maybe that demand lead to more research into GPUs and what they can do - or would we be further along if we weren’t wasting these cards on mining?

fragmede2y ago

Ethereum's (thrice delayed) move to PoS put a glut of GPUs on the market, just in time for the AI boom to swallow them back up, so I think it ended up okay. NVDA certainly had a great few days in the market thanks to AI though.

coffeebeqn2y ago

Eth was mined mainly on consumer GPUs which as far as i understand, have too little VRAM for most AI training

orGANicWeb2y ago

How are you going to sell access and divide the resources?

resonance19942y ago

Just curious, do you guys use renewable energy to power your cluster?

rushingcreek2y ago

I love this. Us at Phind.com would love to be a part of this.

29athrowaway2y ago

During a gold rush, sell shovels.

When was the last time you spoke to a chatbot?

netsec_burn2y ago

For me, today and almost every day since the beginning of this year. Not sure if that saying applies here.

version_five2y ago

Chatbot in the sense I think you mean is a horrible application. Millions of people are using large language models daily though.

lulunananaluna2y ago

Downvoted by others, yet very true. This is a valid business model, nothing to be ashamed about it.

rsync2y ago

"Once the cluster is online ..."

Where will the cluster be hosted ?

May I suggest that you get your IP transit from he.net ?

fragmede2y ago

Not to mention, San Francisco is not known for having cheap real estate, nor is it known for having cheap electricity. My last (residential) bill to PGE, I paid $0.50938/KWh at peak.

vladgur2y ago

While business rates may be different, California cannot be a sensible place to host power-hungry infrastructure - our electrical rates are easily 5-8 times of other locations within the US

AndrewKemendo2y ago

The billion dollar question is:

Who is funding this?

Cause if it’s VC then it’s going to have the same fate as everything else after 5-7 years.

I hope y’all have as innovative of a business model. You’ll need it if you want to do what you’re doing now for more than a few years

fragmede2y ago

What's wrong with doing something profitable for a few years? H100's in a couple of years will be like having a cluster of K80's today.

Not everything has grow to have the appetite of Galactus and swallow a whole planet. Making single digit millions of dollars over a couple of years is still worthwhile, especially if it helps others and moves humanity forwards.

This project isn't ever going to want to try and compete with AWS, so no, it's not a billion dollar question. $20 Million, yeah.

constantly2y ago

You’re completely right in everything you say about growing sustainably and making some money over time. But if this project is VC that all goes out the window and it won’t be profitable unless it massively galactus scales to compete with AWS in 5-7 years, and will fail after that almost certainly, like the vast, vast majority of VC projects.

AndrewKemendo2y ago

Hey I agree!

That’s why I’m asking because a “bootstrapped” company like you describe has a future…

One backed by VC doesn’t

I mean they may have a future but not like you describe

jeepers62y ago

Please take this question without prejudice.

Is it accurate to say you’re willing to go into ~20,000,000 USD debt to sell discounted computer-as-a-service to researchers/startups, but unwilling to go into debt to sponsor the undergraduate degrees of ~100-500 students at top-tier schools? (40k - 200k USD per degree)

Or, you know, build and fund a small public school/library or two for ~5 years?

j / k navigate · click thread line to collapse

176 comments

sillysaurusx2y ago

I hope you succeed. TPU research cloud (TRC) tried this in 2019. It was how I got my start.

In 2023 you can barely get a single TPU for more than an hour. Back then you could get literally hundreds, with an s.

flaqueOP2y ago

> Your project has a youthful optimism that I hope you won’t lose as you go. And in fact it might be the way to win in the long run.

This is the nicest thing anyone has said to us about this. We're gonna frame this and hang it out on our wall.

> So whenever someone comes knocking, begging for a tiny slice of your H100s for their harebrained idea, I hope you’ll humor them.

Absolutely! :D

camhart2y ago

Optimism is (almost) always required in order to accomplish anything of significance. Those who lose it, aren't living up to their potential.

Frost1x2y ago

>Optimism is (almost) always required in order to accomplish anything of significance. Those who lose it, aren't living up to their potential.

4 more replies

dhash2y ago

YC startup founder here,

Mostly agree, except the market is not an optimistic place — it’s the market.

There are a multitude of reasons you lose your optimism, mostly because people take it away — your optimism is their money

johnthewise2y ago

jacquesm2y ago

To me the payoff of failed projects is in what I learned. As long as that's the case I can carry my optimism over into new projects.

hackernewds2y ago

What a beautiful and articulate thought. thank you

zak2y ago

Check out this list of recent TRC-supported publications: https://sites.research.google/trc/publications/

sillysaurusx2y ago

You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

zak2y ago

A few quick comments:

> But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem

1 more reply

nl2y ago

> You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

> Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

Unless I'm misreading this they sound pretty happy and you sound pessimistic? Their last substantial comment was "I'm sure Zak could hook you up with something better"?

1 more reply

choppaface2y ago

The Googlers maintaining the TPU Github repo also just basically don't care about your PR unless it's somehow gonna help them in their own perf review.

In contrast with a GPU-based grid, you can not only run the latest & greatest out-of-the-box but also do a lot of local testing that saves tons of time.

Finally, the OP here appears to be offering real customer engagement, which is totally absent from my own GCloud experiences across several companies.

zak2y ago

ShamelessC2y ago

My understanding was that this situation changed drastically depending on what sort of email you had or how popular your Twitter handle was.

haldujai2y ago

Are you affiliated with an academic institution? Otherwise I'm not sure why they're been more generous with me, my projects have been mildly interesting at best.

They're certainly a lot stingier with larger pods than they used to be though.

latchkey2y ago

What Shawn says is absolutely right. The race right now is way too hot for this stuff. A single customer will eat up 512 gpus for 3 years.

nwoli2y ago

> In 2023 you can barely get a single TPU for more than an hour

Oh come on, colab gives TPU access in the free tier for a whole half day. No need to exaggerate the shortage

LoganDark2y ago

> In 2023 you can barely get a single TPU for more than an hour.

Um. Can't you order them from coral.ai and put them in an NVMe slot? Or are the cloud TPUs more powerful?

whimsicalism2y ago

TPU pod is not sold by google, edge tpu is different

LoganDark2y ago

So the cloud TPUs are more powerful...? Or what are you saying?

4 more replies

whack2y ago

> Rather than each of K startups individually buying clusters of N gpus, together we buy a cluster with NK gpus... Then we set up a job scheduler to allocate compute

tikkun2y ago

Couple things, mostly pricing and availability:

1) Margins. Public cloud investors expect a certain margin profile. They can’t compete with Lambda/Fluidstack’s margins.

3) Availability. Nvidia isn’t giving big clouds the allocation they want.

bravura2y ago

What is your differentiator from Lambda? That you are smaller and in a single DC?

Sincere question.

tikkun2y ago

I'm not OP/submitter, but the main differentiator is that Lambda doesn't have on-demand availability for lots of interlinked H100s - you have to reserve them.

Lambda has "Lambda Sprint" which is kinda similar,[1] but Sprint is $4.85/GPU/hr instead of <$2.

[1]: https://lambdalabs.com/service/gpu-cloud/reserved

TylerE2y ago

Low margins and “will this thing still be around in 2 years” are negatively correlated.

Where’s the capital for upgrades, repairs, and replacements coming from?

littlestymaar2y ago

Of course it doesn't always work, and it may be even harder to make it work in the current macroeconomic environment, but it's still pretty standard play.

aabhay2y ago

whimsicalism2y ago

> It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

I would srgue this has always been a common case for cloud GPU compute

beachy2y ago

AWS and Azure would slit their own throats before they created a way for their customers to pool instances to save money.

They want to do that themselves, and keep the customer relationship and the profits, instead of giving them to a middleman or the customer.

jiggawatts2y ago

It’s just corporate profits combined with market forces, not a some sort of malicious conspiracy.

You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour. That’s just barely above the cost of the electricity and cooling!

What do you want, free compute just handed to you out of the goodness of their hearts?

There is incredible demand for high-end GPUs right now, and market prices reflect that.

beachy2y ago

You mentioned malicious conspiracy, not me.

It's just business and I'd do the same if I was in charge of AWS.

alex_lav2y ago

> You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour.

Source required

1 more reply

mikeravkine2y ago

Sorry where are these .50c many core servers you speak of exactly?

1 more reply

megakwood2y ago

Where can you get 120 cores for $2/hr?

1 more reply

asdfaoeu2y ago

AWS and Azure both charge by the hour anyway so it wouldn't but if you wanted you could use Reserved instances and just have their accounts in the same organisation.

A large part of the profit comes from the upfront risk of buying machines. With this you are just absorbing that risk which may be better if the startup expects to last.

bnr4u2y ago

Having hosted infrastructure in CA at multiple colos. I would advise you to host it elsewhere if you can, cost of power, other infrastructure is much higher in CA than AZ or NV.

PettingRabbits2y ago

Montreal would be the place to go for cheap power, and the CAD-USD advantage.

ec1096852y ago

Power seems like a very small amount of cost of compute when it comes to GPU’s.

version_five2y ago

jedberg2y ago

Retail residential power in the city of Santa Clara is $0.15/KwH, I'm sure commercial could be less. Especially if you throw some solar panels on the roof.

The most expensive part would be the land, but honestly there is some pretty cheap land outside the cities.

2 more replies

wongarsu2y ago

And when we are talking about low margins, a 5-10% difference in cost is very significant.

ec1096852y ago

Meanwhile, AWS is charging $8 an hour for their top of the line gpu server.

hackernewds2y ago

over regulation and taxes

wodenokoto2y ago

> It's just that no cloud provider in the world will give you $100k of compute for just a couple weeks

I've never had to buy very large compute, but I thought that was the whole point of the cloud

williamstein2y ago

How does this compare to https://lambdalabs.com/ ?

flaqueOP2y ago

Ah, we're running a medium amount of compute at zero-margin. The point is not to go sell the Fortune 500, but to make sure a grad student can spend a $50k grant.

trostaft2y ago

> but to make sure a grad student can spend a $50k grant.

narrator2y ago

xeromal2y ago

I assume it's up to them to say no. They did say they're not in it to make bookoo bucks

2 more replies

gnopgnip2y ago

Presumably them buy more gpus

ec1096852y ago

How can you allow people to get big chunks of GPU’s without a lot of expensive slack in the system?

lulunananaluna2y ago

This is great. Thank you very much for your work.

wongarsu2y ago

No idea how capacity at lambdalabs actually looks like though. Does anyone have insight how easy it is to spin up more than 2-3 instances up there?

agajews2y ago

jorlow2y ago

You can usually only get a few h100s at a time unless you're committed to reserved instances (for a longer time period)

ivalm2y ago

No real way to get a big block without commitment. Iirc smallest h100 commitment is 64gpus for 3 years (about $3M usd).

theptip2y ago

My question too. At $2/hr for H100 that seems more flexible? But I haven’t tried to get 10k GPU-hours on any of these services, maybe that is where the bottleneck is.

whimsicalism2y ago

I am super interested in AI on a personal level and have been involved for a number of years.

I have never seen a GPU crunch quite like it is right now. To anyone who is interested in hobbyist ML, I highly highly recommend using vast.ai

tikkun2y ago

Additional clouds:

For H100s and A100s - lambda, fluidstack, runpod. Also coreweave and crusoe and oblivus and latitude

For non a/h100s: vast, Tensordock, also runpod here too

quickthrower22y ago

whimsicalism2y ago

Significantly more expensive than equivalent 3090 configuration if you can do model parallelism

quickthrower22y ago

What do you mean by this? I use less than the $30/m free included usage.

I am guessing you mean at some point just buy your own 3090 as it will be cheaper than paying a cloud per second for a server-grade Nvidia setup.

1 more reply

williamstein2y ago

whimsicalism2y ago

The only caveat is it is not really appropriate for private usecases.

Also, many of the available options clearly are recycled crypto mining rigs which have somewhat odd configurations (poor gpu bandwidth, low cpu ram).

dudus2y ago

But for AI training? If the public cloud isn't competitive even for bursty AI training, their margins are much higher than I anticipated.

OP mentions 10-20x cost reduction? Compared to what? AWS?

jerjerjer2y ago

AWS offers p5.48xlarge which is 8xH100 for $98.32, so 12.29$ per hour per H100 - ~6x the price.

kaycebasques2y ago

[1] We have not gone the way of the Xerces blue [2] yet... we still exist!

[2] https://en.wikipedia.org/wiki/Xerces_blue

agajews2y ago

Ah the hardware isn’t gonna be in SF (not the cheapest datacenter space)

Good idea to host some in person events!

menthe2y ago

> SF is still probably the best place to do startups.

now that's a hot take if I ever saw one

nilsbunger2y ago

I love the idea of community assets. could it be the start of a GPU co-op?

fragmede2y ago

For consumer-grade cards, that's already here.

Make money off your GPU with vast.AI

https://cloud.vast.ai/host/setup

mdaniel2y ago

> Requirements

> Ubuntu 18.04 or newer (required)

> Dedicated machines only - the machine shouldn't be doing other stuff while rented

well that's certainly not what I expected. ctrl-f "virtual" gives nothing, so it seems they really mean "take over your machine"

> Note: you may need to install python2.7 to run the install script.

what kind of nonsense is this? Did they write the script in 2001 and just abandon it?

mschuster912y ago

> what kind of nonsense is this? Did they write the script in 2001 and just abandon it?

2 more replies

williamstein2y ago

1 more reply

orf2y ago

Seems like it: https://s3.amazonaws.com/vast.ai/install

lgats2y ago

check here to see the current bid prices / gpu setups https://cloud.vast.ai/create/

PartiallyTyped2y ago

My computer is sitting mostly idle at home, thanks for this.

samstave2y ago

Basically twitter devolves into the Colos of the late 90s :-)

For those who didnt notice, it was tongue in cheek.

version_five2y ago

aionaiodfgnio2y ago

Would you really trust a company that doesn't pay its rent to run your infrastructure?

mike_d2y ago

moneycantbuy2y ago

How did you get the money to buy 512 H100s?

taminka2y ago

ask no questions hear no lies

rvnx2y ago

EDIT: They seem to be in a raising fund / debt stage. Great initiative

williamstein2y ago

Their announcement says "We can probably get a good deal from a bank [...]", so maybe they don't just have 20M USD sitting around.

1 more reply

herval2y ago

Anyone have an insight?

humanistbot2y ago

From sentence one of the post, it clearly states that they are VC funders who are doing this for a round of startups they just funded, and they're looking for others to be a part of it.

flaqueOP2y ago

Oh no, definitely not. We just got a loan.

Neither Alex or I are currently VCs, and this has no affiliation with any venture fund.

We want to be a customer of the sf compute group too!

opportune2y ago

How’d you get this loan? Is it from a benevolent individual who just wants to make something happen?

1 more reply

xwdv2y ago

I’m curious, how are those loans guaranteed?

1 more reply

itissid2y ago

Noob Thought: So this would be a blue print on how a mid tier universities with older large compute cluster ops could do things in 2023 to support large LLM research?

Perhaps its also a way for freshly applying grad students to look at a university looking to do research in LLMs that requires scale...

itissid2y ago

Like to clarify, a new grad students could look at the current group and ask "Hey I know you are working on LLMs, but how many $$ of your grant are dedicated to how many TPU hours per grad student?"

latchkey2y ago

554 5.7.1 <evan@sfcompute.org>: Relay access denied

554 5.7.1 <alex@sfcompute.org>: Relay access denied

flaqueOP2y ago

!!!!!! fixing this. For the moment, evan at roomservice dot dev

ranting-moth2y ago

Ah, putting out flames live on HN. Back in the day it was on IRC or just on the phone with the customer. I miss those times.

fragmede2y ago

fwiw, https://roomservice.dev/ is currently a 404

flaqueOP2y ago

Ah yeah, that's normal! Was from my old CRDT company, and works as a good emergency email while we debug our DNS.

2 more replies

latchkey2y ago

http != smtp

  roomservice.dev. 60 IN MX 5 alt1.aspmx.l.google.com.
  roomservice.dev. 60 IN MX 5 alt2.aspmx.l.google.com.
  roomservice.dev. 60 IN MX 1 aspmx.l.google.com.
  roomservice.dev. 60 IN MX 10 alt3.aspmx.l.google.com.
  roomservice.dev. 60 IN MX 10 alt4.aspmx.l.google.com.
  roomservice.dev. 60 IN MX 15 4ig53n4pw7p3cuxm7n7xi7dpuyq6722aipexvhkngzbd2e4mudmq.mx-verification.google.com.

2 more replies

latchkey2y ago

done

sashank_15092y ago

Correct me if I’m wrong but doesn’t Lambda Labs already provide them at 1.89$? What’s the point if you’re starting out not the cheapest

agajews2y ago

Ah that’s only if you pay for 3 years of compute upfront. Most startups, especially the small ones, really can’t afford that

davidmurphy2y ago

Looks like their site is quoting a rate of $1.99 now https://lambdalabs.com/

version_five2y ago

See this post above: https://news.ycombinator.com/item?id=36935032

Price and market depth are very different things

mackid2y ago

Nat Friedman and Daniel Gross setup a 2,512 H100 cluster [1] for their startups, with a very similar “shared” model. Might be interesting to connect with them.

[1] https://andromedacluster.com/

flaqueOP2y ago

Nat & Daniel’s cluster is great, and we fully recommend startups seek out this option as well. Nat & Daniel are some of the best investors one can have

metadat2y ago

Will it be a Slurm cluster, or what kind of scheduler is SFC planning to use?

ucarion2y ago

Wishing y'all the best of luck. This would be huge for a lot of folks.

PettingRabbits2y ago

What kind of hardware setup are you planning out? Colocation, roll-your-own data center, something in between? Any thoughts on what servers the GPUs will be housed in?

netcraft2y ago

fragmede2y ago

coffeebeqn2y ago

Eth was mined mainly on consumer GPUs which as far as i understand, have too little VRAM for most AI training

orGANicWeb2y ago

How are you going to sell access and divide the resources?

resonance19942y ago

Just curious, do you guys use renewable energy to power your cluster?

rushingcreek2y ago

I love this. Us at Phind.com would love to be a part of this.

29athrowaway2y ago

During a gold rush, sell shovels.

When was the last time you spoke to a chatbot?

netsec_burn2y ago

For me, today and almost every day since the beginning of this year. Not sure if that saying applies here.

version_five2y ago

Chatbot in the sense I think you mean is a horrible application. Millions of people are using large language models daily though.

lulunananaluna2y ago

Downvoted by others, yet very true. This is a valid business model, nothing to be ashamed about it.

rsync2y ago

"Once the cluster is online ..."

Where will the cluster be hosted ?

May I suggest that you get your IP transit from he.net ?

fragmede2y ago

Not to mention, San Francisco is not known for having cheap real estate, nor is it known for having cheap electricity. My last (residential) bill to PGE, I paid $0.50938/KWh at peak.

vladgur2y ago

While business rates may be different, California cannot be a sensible place to host power-hungry infrastructure - our electrical rates are easily 5-8 times of other locations within the US

AndrewKemendo2y ago

The billion dollar question is:

Who is funding this?

Cause if it’s VC then it’s going to have the same fate as everything else after 5-7 years.

I hope y’all have as innovative of a business model. You’ll need it if you want to do what you’re doing now for more than a few years

fragmede2y ago

What's wrong with doing something profitable for a few years? H100's in a couple of years will be like having a cluster of K80's today.

This project isn't ever going to want to try and compete with AWS, so no, it's not a billion dollar question. $20 Million, yeah.

constantly2y ago

AndrewKemendo2y ago

Hey I agree!

That’s why I’m asking because a “bootstrapped” company like you describe has a future…

One backed by VC doesn’t

I mean they may have a future but not like you describe

jeepers62y ago

Please take this question without prejudice.

Or, you know, build and fund a small public school/library or two for ~5 years?

j / k navigate · click thread line to collapse