Part of the main reason I built this was data privacy, I do not want to hand over my private data to any company to further train their closed weight models; and given the recent drop in output quality on different platforms (ChatGPT, Claude, etc), I don't regret spending the money on this setup.
I was also able to do a lot of cool things using this server by leveraging tensor parallelism and batch inference, generating synthetic data, and experimenting with finetuning models using my private data. I am currently building a model from scratch, mainly as a learning project, but I am also finding some cool things while doing so and if I can get around ironing out the kinks, I might release it and write a tutorial from my notes.
So I finally had the time this weekend to get my blog up and running, and I am planning on following up this blog post with a series of posts on my learnings and findings. I am also open to topics and ideas to experiment with on this server and write about, so feel free to shoot your shot if you have ideas you want to experiment with and don't have the hardware, I am more than willing to do that on your behalf and sharing the findings
Please let me know if you have any questions, my PMs are open, and you can also reach me on any of the socials I have posted on my website.
I wrote a blog on reducing the power limits of nvidia gpus. Definitely try it out. https://shelbyjenkins.github.io/blog/power-limit-nvidia-linu...
It is not expensive, nor is it highly technical. It's not like we're factoring in latency and crosstalk...
Read a quick howto, cruise into Home Depot and grab some legos off the shelf. Far easier to figure out than executing "hello world" without domain expertise.
FYI, I can handle electrical system design and sheet metal enclosure design/fabrication for these rigs, but my software knowledge is limited when it comes to ML. If anyone's interested, I'd love to collaborate on a joint venture to produce these rigs commercially.
I'm curious, how do you use e.g. a washing machine or an electric kettle, if 2kW is enough to flip your breaker? You should simply know your wiring limits. Breaker/wiring at my home won't even notice this.
In the real world you would plug them into a PDU such as: https://www.apc.com/us/en/product/AP9571A/rack-pdu-basic-1u-...
Each GPU will take around 700W and then you have the rest of the system to power, so depending on CPU/RAM/storage...
And then you need to cool it!
Hell most kettles use 3kw. Tho for a big server I'd get it wired dedicated, same way power showers are done (7-12~ kW)
Which is all to say its possible in a residential setting, just probably expensive.
16 amps x 120v = 1920W, it would probably trip after several minutes.
16 amps x 230v = 3680W, it wouldn't trip.
So, as mentioned on the article, I actually have installed (2) 30amp 240v breakers dedicated entirely for this setup (and the next one in case I decide to expand to 16x GPUs over 2 nodes lol). Each breaker is supposed to power up to 6000w at ease. I also installed a specific kind of power outlet that can handle that kind of current, and I have done some extreme research into PDUs. I plan on posting about all of that in this series (part 3 according to my current tentative plans) so stay tuned and maybe bookmark the website/add the RSS feed to your digest/or follow me on any of the socials if this is something that you wanna nail down without spending a month on research like me :'D
What is your cost of electricity per kilowatt hour and what is the cost of this setup per month?
Maybe a bit of a stupid question, but what do you actually do with the models you run/build, a part from tinkering? I'd assume most tinkering can also be done on smaller systems? Is it in order to build a model that is actually 'useful'/competitive?
But problem is even 7b models are too slow on my pc.
Hosted models are lightening fast. I considered possibility of buying hardware but decided against it.
I wonder if this will happen. It's already really hard to buy big HDDs for my NAS because nobody buys external drives anymore. So the pricing has gone up a lot for the prosumer.
I expect something similar to happen to AI. The big cloud parties are all big leaders on LLMs and their goal is to keep us beholden to their cloud service. Cheap home hardware work serious capability is not something they're interested in. They want to keep it out of our reach so we can pay them rent and they can mine our data.
That said, I really don't think that the way forward for hobbyists is maxing VRAM. Small models are becoming much more capable and accelerators are a possibility, and there may not be a need for a person to run a 70billion parameter model in memory at all when there are MoEs like Mixtral and small capable models like phi.
I buy refurb/used enterprise drives for that reason, generally around $12 per TB for the recent larger drives. And around $6 per TB for smaller drives. You just need an SAS interface but that's not difficult or expensive.
IE; 25TB for $320, or 12TB for $80.
IME 20tb drives are easy to find.
I don't think the clouds have access to bigger drives or anything.
Similarly, we can buy 8x A100s, they're just fundamentally expensive whether you're a business or not.
There doesn't seem to be any "wall" up like there used to be with proprietary hardware.
For me these prices are prohibitive. Just like the A100s are (though those are even more so of course).
The problem is the common consumer relying on the cloud so these kind of products become niches and lose volume. Also, the cloud providers don't pay what we do for a GPU or HDD. They buy them by the ten thousands and get deep discounts. That's why the RRPs which we do pay are highly inflated.
Of course the vendor can't make a profit with such discounts so they inflate the RRP. But we do end up paying that.
> the heir to rear projection — a dynamic, real-time, photo-real background played back on a massive LED video wall and ceiling, which not only provided the pixel-accurate representation of exotic background content, but was also rendered with correct camera positional data.. “We take objects that the art department have created and we employ photogrammetry on each item to get them into the game engine”
Do you have a rough estimate of how much this cost? I'm curious since I just built my own 2x 3090 rig and I wondered about going EPYC for the potential to have more cards (stuck with AM5 for cheapness though).
All in all I spent about $3500 for everything. I'm guessing this is closer to $12-15k? CPU is around $800 on eBay.
It also costs a lot to power. In the summer, 2x more than you expect, because unless it’s outside, you need cool 1000+ watts of extra heat with your AC. All that together and runpod starts to look very tempting!
I have a setup with 3 RTX 3090 GPUs and the PCIe risers are a huge source of pain and system crashes.
I've had my eye on these for a bit https://c-payne.com/
The worst thing is dust. They would accumulate so much every week I had to blow the dust off with an air compressor.
Electricity cost was around $4 a day (24 x $0.20~). If online GPU renting is more expensive, maybe the initial cost could be justifiable.
Except not doing the sketchy x1 pcie lanes. That’s the part that makes nice LLM setups hard
This might be the right time to ask: So, on the one hand, this is what it takes to pack 192gb of Nvidia flavored vram into a home server.
I'm curious, is there any hope of doing any interesting work on a MacBook Pro Which currently can be max-spaced at 128 GB of unified memory (for the low, low price of $4.7k).
I know there's no hope of running cuda on the macbook, and I'm clearly out of my depth here. But the possibly naive day-dream of tossing a massive LLM into a backpack is alluring...
My assumption was that going beyond 2 cards incurs significant bandwidth penalty when going from NVLink between 2x3090s to PCIe for communicating between the other 3090s.
What kind of T/s speeds are you getting with this type of 8x3090 setup?
Presumably then even crazier 16x4090 would be an option for someone with enough PCIe slots/risers/extenders.
I hope this guy posts updates.
Are you intending to use the capacity all for yourself or rent it out to others?
As a side note I’d love to find a chart/data on the cost performance ratio of open source models. And possibly then a $/ELO value (where $ is the cost to build and operate the machine and ELO kind of a proxy value for the average performance of the model)
I haven't had enough time to find a way to split inference which is what I'm most interested in. Yours is also much better with the 1600 W supply. I have a hodge podge.
I'm a believer! Can't wait to hear more about this.
I'm excited to see your benchmarks :)
Is a blockchain needed to sell unused GPU capacity?
Eventually there could be some tipping point where networks are fast enough and there are enough hosting participants it could be like a worldwide/free computing platform - not just for AI for anything.
IRL all you need is a simple platform to pay and schedule jobs on other’s GPUs.
- Fitting models in memory
- Inference / Training speed
8 x RTX 3090s will absolutely CRUSH a single Mac Studio in raw performance.
Also, modern GPUs are surprisingly good at throttling their power usage when not actively in use, just like CPUs. So while you need 3kW+ worth of PSU for an 8x3090 setup, it’s not going to be using anywhere near 3kW of power on average, unless you’re literally using the LLM 24x7.
Running llama3.1 70B is brutal on this thing. Responses take minutes. Someone running the same model on 32GB of GPU memory seems to have far better results from what I've read.
I'm currently using reflection:70b_q4 which does a very good job in my opinion. It generates with 5.5 tokens/s for the response, which is just about my reading speed.
edit: I usually dont run larger models (q6) because of the speed. I'd guess a 405B model would just be awfully slow.