Ceph: A Journey to 1 TiB/s (opens in new tab)

(ceph.io)

411 pointsdavidmr2y ago210 comments

210 comments

alberth2y ago

Ceph has an interesting history.

It was created at Dreamhost (DH), for their internal needs by the founders.

DH was doing effectively IaaS & PaaS before those were industry coined words (VPS, managed OS/database/app-servers).

They spun Ceph off and Redhat bought it.

https://en.wikipedia.org/wiki/DreamHost

artyom2y ago

Yeah, as a customer (still one) I remember their "Hey, we're going to build this Ceph thing, maybe it ends up being cool" blog entry (or newsletter?) kinda just sharing what they were toying with. It was a time of no marketing copy and not crafting every sentence to sell you things.

I think it was the university project of one of the founders, and the others jumped in supporting it. Docker has a similar origins story as far as I know.

pas2y ago

https://en.wikipedia.org/wiki/Sage_Weil right?

https://ceph.com/assets/pdfs/weil-crush-sc06.pdf

epistasis2y ago

A bit more to the story is that it was created also at UC Santa Cruz, by Sage Weil, a Dreamhost founder, while he was doing graduate work there. UCSC has had a lot of good storage research.

AdamJacobMuller2y ago

I remember the first time I deployed ceph, would have been around 2010 or 2011, had some really major issues which would nearly resulted in data loss and due to someone else not realizing what "this cluster is experimental, do not store any important data here" meant, the data on ceph was the only copy of the irreplaceable data in the world, loosing the data would have been fairly catastrophic for us.

I ended up on the ceph IRC channel and eventually had Sage helping me fix the issues directly, helping me find bugs and writing patches to fix them in realtime.

Super amazingly nice guy that he was willing to help, never once chastised me for being so stupid (even though I was), also wicked smart.

antongribok2y ago

Sage is one of the nicest, down to earth, super smart individuals I've met.

I've talked to him at a few OpenStack and Ceph conferences, and he's always very patient answering questions.

dekhn2y ago

the fighting banana slugs

amadio2y ago

Nice article! We've also recently reached the mark of 1TB/s at CERN, but with EOS (https://cern.ch/eos), not ceph: https://www.home.cern/news/news/computing/exabyte-disk-stora...

Our EOS clusters have a lot more nodes, however, and use mostly HDDs. CERN also uses ceph extensively.

theyinwhy2y ago

Great! What's your take on ceph? Is the idea to migrate to EOS long term?

amadio2y ago

EOS and ceph have different use cases at CERN. EOS holds physics data and user data in CERNBox, while ceph is used for a lot of the rest (e.g. storage for VMs, and other applications). So both will continue to be used as they are now. CERN has over 100PB on ceph.

ComputerGuru2y ago

Is there a reason you run both and don't converge on one or the other?

1 more reply

stuff4ben2y ago

I used to love doing experiments like this. I was afforded that luxury as a tech lead back when I was at Cisco setting up Kubernetes on bare metal and getting to play with setting up GlusterFS and Ceph just to learn and see which was better. This was back in 2017/2018 if I recall. Good ole days. Loved this writeup!

knicholes2y ago

I had to run a bunch of benchmarks to compare speeds of not just AWS instance types, but actual individual instances in each type, as some NVME SSDs have been more used than others in order to lube up some Aerospike response times. Crazy.

j33zusjuice2y ago

Ad-tech, or?

knicholes2y ago

Yeah. Serving profiles for customized ad selection.

redrove2y ago

A Heketi man! I had the same experience around the same years, what a blast. Everything was so new..and broken!

CTrox2y ago

Same here, still remember that time our Heketi DB partially corrupted and we had to fix it up by exporting it to a massive json file, fix it up by looking at the Gluster state and importing it again. I can't quite remember the details but I think it had to do with Gluster snapshots being out of sync with the state in the DB.

amluto2y ago

I wish someone would try to scale the nodes down. The system described here is ~300W/node for 10 disks/node, so 30W or so per disk. That’s a fair amount of overhead, and it also requires quite a lot of storage to get any redundancy at all.

I bet some engineering effort could divide the whole thing by 10. Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+ sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an eMMC chip or SD slot for boot.

This could scale down to just a few nodes, and it reduces the exposure to a single failure taking out 10 disks at a time.

I bet a lot of copies of this system could fit in a 4U enclosure. Optionally the same enclosure could contain two entirely independent switches to aggregate the internal nodes.

evanreichard2y ago

I used to run a 5 node Ceph cluster on a bunch of ODROID-HC2's [0]. Was a royal pain to get installed (armhf processor). But once it was running it worked great. Just slow with the single 1Gb NIC.

Was just a learning experience at the time.

[0] https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/

eurekin2y ago

Same here, but on PI 4b's. 6 node cluster with a 2tb hdd and 512 Tb ssd per node. CEPH made a huge impression on me, as in I didn't recognize how extensive the package was. I went up to 122mb/s and thought it's too little for my hack-NAS replacement :)

The functionality: mixing various pool types on the same set of SSD's, different redundancy types (erasure coded, replicated) was very impressive. Now I can't help but look down at a RAID NAS in comparision. Still, some extra packages like the NFS exporter were not ready for the arm architecture

pl4nty2y ago

Nvidia's SODIMM compute module interface can prove this concept already. I have two 7W ARM Turing RK1s arriving soon, each with PCIe 3x4 at 4GB/s, and the Turing Pi 2 cluster board can fit four in an ITX form factor. I'm expecting over 3Gbps per watt at a total cost of 820USD

PCIe lanes are the bottleneck so far - even my $90 2TB SSDs are rated at 7GB/s on PCIe 4x4. So I don't think SBCs are the optimal solution yet. Looks like Ampere's Altra line can do PCIe 4x128 at 40W so a 1U blade with 100G networking could be interesting. I've seen lots of bugs and missing optimisations with ARM though, even in a homelab, so this kind of solution might not be ready for datacenters yet

walrus012y ago

10 Gbps is increasingly obsolete with very low cost 100 Gbps switches and 100Gbps interfaces. Something would have to be really tiny and low cost to justify doing a ceph setup with 10Gbps interfaces now... If you're at that scale of very small stuff you are probably better off doing local NVME storage on each server instead.

Palomides2y ago

here's a weird calculation:

this cluster does something vaguely like 0.8 gigabits per second per watt (1 terabyte/s * 8 bits per byte * 1024 gb per tb / 34 nodes / 300 watts

a new mac mini (super efficient arm system) runs around 10 watts in interactive usage and can do 10 gigabits per second network, so maybe 1 gigabit per second per watt of data

so OP's cluster, back of the envelope, is basically the same bits per second per watt that a very efficient arm system can do

I don't think running tiny nodes would actually get you any more efficiency, and would probably cost more! performance per watt is quite good on powerful servers now

anyway, this is all open source software running on off-the-shelf hardware, you can do it yourself for a few hundred bucks

georgyo2y ago

You're comparing one machine with many machines.

You're comparing raw disks with shards and erasure encouraging.

Lastly, you're comparing only network bandwidth and not storage capacity.

amluto2y ago

I think the Mac Mini has massively more compute than needed for this kind of work. It also has a power supply, and computer power supplies are generally not amazing at low output.

I’m imagining something quite specialized. Use a low frequency CPU with either vector units or even DMA engines optimized for the specific workloads needed, or go all out and arrange for data to be DMAed directly between the disk and the NIC.

yencabulator2y ago

> or go all out and arrange for data to be DMAed directly between the disk and the NIC.

Ceph OSDs do a lot more work than you're imagining.

Palomides2y ago

sounds like a DPU (mellanox bluefield for example), they're entire ARM systems with a high speed NIC all on a PCIe card, I think the bluefield ones can even directly interface over the bus to nvme drives without the host system involved

1 more reply

hirako20002y ago

I checked selling prices of those racks + top end SSDs, this 1Tb/s achievement runs on $4 million worth of hardware cluster. Or more I didn't check the networking interface costs.

But yeah could run on commodity hardware. Not sure those highly efficient arm packaged for a premium from Apple would beat the Dell racks though regarding throughput relative to hardware investment costs.

amluto2y ago

Dell’s list prices have essentially nothing to do with the prices that any competent buyer would actually pay, especially when storage is involved. Look at the prices of Dell disks, which are nothing special compared to name brand disks of equal or better spec and much lower list price.

I don’t know what discount large buyers get, but I wouldn’t be surprised if it’s around 75%.

1 more reply

3abiton2y ago

Trusting your maths, damn Apple did a great job on their M design.

hirako20002y ago

Didn't ARM (the company, that originally designed ARM processors) do most of that job and Apple pushed perf to consumption even further?

jeffbee2y ago

I think the chief source of inefficiency in this architecture would be the NVMe controller. When the operating system and the NVMe device are at arm's length, there is natural inefficiency, as the controller needs to infer the intent of the request and do its best in terms of placement and wear leveling. The new FDP (flexible data placement) features try to address this by giving the operating system more control. The best thing would be to just hoist it all up into the host operating system and present the flash, as nearly as possible, as a giant field of dumb transistors that happens to be a PCIe device. With layers of abstraction removed, the hardware unit could be something like an Atom with integrated 100gbps NICs and a proportional amount of flash to achieve the desired system parallelism.

booi2y ago

Is that a lot of overhead? The disk itself uses about 10W and high speed controllers use about 75W leaves pretty much 100W for the rest of the system including overhead of about 10%. Scale up the system to 16 disks and there’s not a lot of room for improvement

somat2y ago

I have always wanted to set up a ceph system with one drive per node. The ideal form factor would be a drive with a couple network interfaces built in. western digital had a press release about an experiment they did that was exactly this, but it never ended up with drive you could buy.

The hardkernel HC2 SOC was a nearly ideal form factor for this, and I still have a stack of them laying around that I bought to make a ceph cluster, but I ran out of steam when I figured out they were 32bit. not to say it would be impossible I just never did it.

narism2y ago

Sounds like the Seagate Kinetic HDD: https://www.seagate.com/www-content/product-content/hdd-fam/...

somat2y ago

That would be perfect. Unfortunately, going by the data sheet it would not run ceph you would have to work with seagate's proprietary object store. I will note that as far as I can tell it is unobtainium. none of the usual vendors stock them, you probably have to prove to seagate that you are a "serious enterprise customer" and commit to a thousand units before they will let you buy some.

progval2y ago

I used to use Ceph Luminous (v12) on these, they worked fine. Unfortunately, a bug in Nautilus (v14) prevented 32-bits and 64-bits archs from talking to each other. Pacific (v16) allegedly solves this, but I didn't try it: https://ceph.com/en/news/blog/2021/v16-2-5-pacific-released/

If you want to try it with a more modern (and 64-bits) device, the hardkernel HC4 might do it for you. It's conceptually similar to the HC2 but has two drives. Unfortunately it only has double the RAM (4GB), which is probably not enough anymore.

eurekin2y ago

Looks so good, wish for a > 1gbit version, since HDDs alone can saturate that

1 more reply

kbenson2y ago

There probably is a sweet spot for power to speed, but I think it's possibly a bit larger than you suggest. There's overhead from the other components as well. For example, the Mellanox NIC seems to utilize about 20W itself, and while the reduced numbers of drives might allow for a single port NIC which seems to use about half the power, if we're going to increase the number of cables (3 per 12 disks instead of 2 per 5), we're not just increasing the power usage of the nodes themselves put also possible increasing the power usage or changing the type of switch required to combine the nodes.

If looked at as a whole, it appears to be more about whether you're combining resources at a low level (on the PCI bus on nodes) or a high level (in the switching infrastructure), and we should be careful not to push power (or complexity, as is often a similar goal) to a separate part of the system that is out of our immediate thoughts but still very much part of the system. Then again, sometimes parts of the system are much better at handling the complexity for certain cases, so in those cases that can be a definite win.

wildylion2y ago

IIRC, WD has experimented with placing Ethernet and some compute directly onto hard drives some time back.

sigh I used to do some small-scale Ceph back in 2017 or so...

chx2y ago

There was a point in history when the total amount of digital data stored worldwide reached 1TiB for the first time. It is extremely likely this day was within the last sixty years.

And here we are moving that amount of data every second on the servers of a fairly random entity. We not talking of a nation state or a supranatural research effort.

qingcharles2y ago

That reminds me of a calculation I did which showed that my desktop PC would be more powerful than all of the computers on the planet combined in like 1978 :D

plagiarist2y ago

My phone has more computation than anything I would have imagined owning, and I sometimes turn on the screen just to use as a quick flashlight.

qingcharles2y ago

Haha.. imagine taking it back to 1978 and showing how it has more computing power than the entire planet and then telling them that you mostly just use it to find that thing you lost under the couch :D

fiddlerwoaroof2y ago

It’s at least 20ish years ago: I remember an old sysadmin talking about managing petabytes before 2003

chx2y ago

Must be much more than 20ish years, some 2400 ft reels in the 60s stored a few megabytes, you only need 100 000s of those to reach a terabyte. https://en.wikipedia.org/wiki/IBM_7330

> a single 2400-foot tape could store the equivalent of some 50,000 punched cards (about 4,000,000 six-bit bytes).

In 1964 with the introduction of System/360 you are going a magnitude higher https://www.core77.com/posts/108573/A-Storage-Cabinet-Based-...

> It could store a maximum of 45MB on 2,400 feet

At this point you only need a few ten thousand reels in existence to reach a terabyte. So I strongly suspect the "terabyte point" was some time in the 1960s.

aspenmayer2y ago

Those numbers seem reasonable in that context. I first started using BitTorrent around that time as well, and it wasn't uncommon to see many users long-term seeding multiple hundreds of gigabytes of Linux ISOs alone.

Here’s another usage scenario with data usage numbers I found a while back.

> A 2004 paper published in ACM Transactions on Programming Languages and Systems shows how Hancock code can sift calling card records, long distance calls, IP addresses and internet traffic dumps, and even track the physical movements of mobile phone customers as their signal moves from cell site to cell site.

> With Hancock, "analysts could store sufficiently precise information to enable new applications previously thought to be infeasible," the program authors wrote. AT&T uses Hancock code to sift 9 GB of telephone traffic data a night, according to the paper.

https://web.archive.org/web/20200309221602/https://www.wired...

fiddlerwoaroof2y ago

Yeah, at the other end of the scale, it sounds like Apple is now managing exabytes: https://read.engineerscodex.com/p/how-apple-built-icloud-to-...

This is pretty mind-boggling to me.

ComputerGuru2y ago

I archived Hancock here over a decade ago, stumbled upon it via HN at the time if I’m not mistaken: https://github.com/mqudsi/hancock

1 more reply

chx2y ago

I raised this to retro se and https://retrocomputing.stackexchange.com/a/28322/3722 notes a TiB of digital data likely was reached in the 1930s with punch cards.

kylegalbraith2y ago

This is a fascinating read. We run a Ceph storage cluster for persisting Docker layer cache [0]. We went from using EBS to Ceph and saw a massive difference in throughput. Went from a write throughput of 146 MB/s and 3,000 IOPS to 900 MB/s and 30,000 IOPS.

The best part is that it pretty much just works. Very little babysitting with the exception of the occasional fs trim or something.

It’s been a massive improvement for our caching system.

[0] https://depot.dev/blog/cache-v2-faster-builds

guywhocodes2y ago

Did something very similar almost 10 years ago, EBS costs were 10x+ the cost for same perfomance CEPH cluster on the node disks. Eventually we switched to our own racks and cut it almost in ten again. We developed the inhouse expertise for how to do it and we were free.

e12e2y ago

Did you host ebs on bare metal? How are you hosting ceph - your own/rented metal, ec2 - VMs?

Wasn't immediately clear to me from the blog.

kylegalbraith2y ago

We started with AWS EBS volumes with BuildKit on EC2. We've now moved to BuildKit on EC2 and a Ceph storage cluster on bare metal EC2 instances.

MPSimmons2y ago

The worst problems I've had with in-cluster dynamic storage were never strictly IO related, and were more the storage controller software in kubernetes having problems with real-world problems like pods dying and the PVCs not attaching until after very long timeouts expired, with the pod sitting in ContainerCreating until the PVC lock was freed.

This has happened in multiple clusters, using rook/ceph as well as Longhorn.

matheusmoreira2y ago

Does anyone have experience running ceph in a home lab? Last time I looked into it, there were quite significant hardware requirements.

nullwarp2y ago

There still are. As someone who has done both production and homelab deployments: unless you are specifically just looking for experience with it and just setting up a demo - don't bother.

When it works, it works great - when it goes wrong it's a huge headache.

Edit: As just an edit, if distributed storage is just something you are interested in there are much better options for a homelab setup:

- seaweedfs has been rock solid for me for years in both small and huge scales. we actually moved our production ceph setup to this.

- longhorn was solid for me when i was in the k8s world

- glusterfs is still fine as long as you know what you are going into.

matheusmoreira2y ago

I just want to hoard data. I hate having to delete stuff to make space. Things disappear from the web every day. I should hold onto them.

My requirements for a storage solution are:

> Single root file system

> Storage device failure tolerance

> Gradual expansion capability

The problem with every storage solution I've ever seen is the lack of gradual expandability. I'm not a corporation, I'm just a guy. I don't have the money to buy 200 hard disks all at once. I need to gradually expand capacity as needed.

I was attracted to this ceph because it apparently allows you to throw a bunch of drives of any make and model at it and it just pools them all up without complaining. The complexity is nightmarish though.

ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID. Expansion features seem to be just about to land for quite a few years now. I remember getting excited about it after seeing news here only for people to deflate my expectations. Btrfs has a flexible block allocator which is just what I need but... It's btrfs.

chromatin2y ago

> ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID.

if you don't mind the overhead of a "pool of mirrors" approach [1], then it is easy to expand storage by adding pairs of disks! This is how my home NAS is configured.

[1] https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...

3 more replies

deadbunny2y ago

ZFS using mirrors is extremely easy to expand. Need more space and you have small drives? Replace the drives in a mirror one by one with bigger ones. Need more space and already have huge drives? Just add another vdev mirror. And the added benefit of not living in fear of drive failure while resilvering as it is much faster with mirrors than raidX.

Sure the density isn't great as you're essentially running at 50% or raw storage but - touches wood - my home zpool has been running strong for about a decade doing the above from 6x 6tb drives (3x 6tb mirrors) to 16x 10-20tb drives (8x mirrors, differing sized drives but matched per mirror like a 10tb x2 mirror, a 16tb x2 mirror etc).

Edit: Just realised someone else as already mentioned a pool or mirrors. Consider this another +1.

1 more reply

sekh602y ago

I've run Ceph at home since the jewel release. I migrated to it after running FreeNAS.

I use it for RBD volumes for my OpenStack cluster and for CephFS. With a total raw capacity of around 350TiB. Around 14 of that is nvme storage for RBD and CephFS metadata. The rest is rust. This is spread across 5 nodes.

I currently am only buying 20TB exos drives for rust. SMR and I think HSMR are both no goes for Ceph as are non enterprise SSDs, so storage is expensive. Ibdinhave a mix of disks though as the cluster has grown organically. So I have a few 6TB WD Reds in there, before their SMR shift.

My networks for OpenStack, Ceph and Ceph backend are all 10Gbps. With the flash storage when repairing I get about 8GiB/s a second. With rust it is around 270MiB/s. The bottle neck I think is due to 3 of the nodes running on first gen xeon-d boards, the the few Reds do slow things down too. The 4th node runs an AMD Rome CPU, and the newest an AMD Genoa cpu. So I am looking at about 5k CAD a node before disks. I colocate the MDS, OSDs and MONs, with 64GiB of ram each. Each node gets 6 rust, and 2 nvme drives.

Complexity is pretty simple. I deployed the initial iteration by hand, and then when cephadmin was released i converted it daemon by daemon smoothly. I find on the mailing lists and Reddit most of the people encountering problems deploy it via Proxmox and don't really understand Ceph because of it.

Snow_Falls2y ago

If you're willing to use mirror vdevs, expansions can be done two drives at a time.Also, depending on how often your data changes, you should check out snapraid. Doesn't have all the features of ZFS but its perfect for stuff that rarely changes (media or, in your case, archiving).

Also unionfs or similar can let you merge zfs and snapraid into one unified filesystem so you can place important data in zfs and unchanging archive data in snapraid.

bityard2y ago

On a single host, you could do this with LVM. Add a pair of disks, make them a RAID 1, create a physical volume on them, then a volume group, then a logical volume with XFS on top. To expand, you add a pair of disks, RAID 1 them, and add them to the LVM. It's a little stupid, but it would work.

If multiple nodes are not off the table, also look into seaweedfs.

Also consider how (or if) you are going to back up your hoard of data.

1 more reply

amadio2y ago

EOS (https://cern.ch/eos, https://github.com/cern-eos/eos) is probably a bit more complicated than other solutions to setup and manage, but does allow to add/remove new disks and nodes serving data on the fly. This is essential to let us upgrade harware of the clusters serving experimental data with minimal to no downtime.

nijave2y ago

Not sure what the multidisk consensus is for btrfs now-a-days but adding/removing devices is trivial, you can do "offline" dedupe, and you can rebalance data if you change the disk config.

As an added bonus it's also in-tree so you don't have to worry about kernel updates breaking things

I think you can also potentially do btrfs+LVM and let LVM manage multi device. Not sure what performance looks like there, though

1 more reply

rglullis2y ago

> glusterfs is still fine as long as you know what you are going into.

Does that include storage volumes for databases? I was using glusterFS as a way to scale my swarm cluster horizontally and I am reasonably sure that it corrupted one database to the point I lost more than a few hours of data. I was quite satisfied with the setup until I hit that.

I know that I am considered crazy for sticking with Docker Swarm until now, but aside from this lingering issue with how to manage stateful services, I've honestly don't feel the need to move yet to k8s. My clusters is ~10 nodes running < 30 stacks and it's not like I have tens of people working with me on it.

camkego2y ago

Docker Swarm seems to be underrated, from a simplicity and reliability perspective, IMHO.

reactordev2y ago

I'd throw minio [1] in the list there as well for homelab k8s object storage.

[1] https://min.io/

speedgoose2y ago

Also garage. https://garagehq.deuxfleurs.fr/

1 more reply

plagiarist2y ago

Minio doesn't make any sense to me in a homelab. Unless I'm reading it wrong it sounds like a giant pain to add more capacity while it is already in use. There's basically no situation where I'm more likely to add capacity over time than a homelab.

1 more reply

bityard2y ago

Ceph is sort of a storage all-in-one: it provides object storage, block storage, and network file storage. May I ask, which of these are you using seaweedfs for? Is it as performant as Ceph claims to be?

dataangel2y ago

I really wish there was a benchmark comparing all of these + MinIO and S3. I'm in the market for a key value store, using S3 for now but eyeing moving to my own hardware in the future and having to do all the work to compare these is one of the major things making me procrastinate.

rglullis2y ago

Minio gives you "only" S3 object storage. I've setup a 3-node Minio cluster for object storage on Hetzner, each server having 4x10TB, for ~50€/month each. This means 80TB usable data for ~150€/month. It can be worth it if you are trying to avoid egress fees, but if I were building a data lake or anything where the data was used mostly for internal services, I'd just stick with S3.

woopwoop242y ago

minio is good but you really need fast disks. They also really don't like, when you want to change the size of your cluster setup. No plan to add cache disks, they just say use faster disks. I have it running, goes smoothly but not really user friendly to optimize

1 more reply

cholmon2y ago

GlusterFS support looks to be permanently ending later this year.

https://access.redhat.com/support/policy/updates/rhs

Note that the Red Hat Gluster Storage product has a defined support lifecycle through to 31-Dec-24, after which the Red Hat Gluster Storage product will have reached its EOL. Specifically, RHGS 3.5 represents the final supported RHGS series of releases.

For folks using GlusterFS currently, what's your plan after this year?

sob7272y ago

Curious, what do you mean by "know what you go into" re glusterfs?

I recently tried ceph in a homelab setup, gave up because of complexity, and settled on glusterfs. I'm not a pro though, so I'm not sure if there's any shortcomings that are clear to everybody but me, hence why your comment caught my attention.

asadhaider2y ago

I thought it was popular for people running Proxmox clusters

geerlingguy2y ago

It is, and if you have a few nodes with at least 10 GbE networking, it's certainly the best clustered storage option I can think of.

ianlevesque2y ago

I played around with it and it has a very cool web UI, object storage & file storage, but it was very hard to get decent performance and it was possible to get the metadata daemons stuck pretty easily with a small cluster. Ultimately when the fun wore off I just put zfs on a single box instead.

victorhooi2y ago

I have some experience with Ceph, both for work, and with homelab-y stuff.

First, bear in mind that Ceph is a distributed storage system - so the idea is that you will have multiple nodes.

For learning, you can definitely virtualise it all on a single box - but you'll have a better time with discrete physical machines.

Also, Ceph does prefer physical access to disks (similar to ZFS).

And you do need decent networking connectivity - I think that's the main thing people think of, when they think of high hardware requirements for Ceph. Ideally 10Gbe at the minimum - although more if you want higher performance - there can be a lot of network traffic, particularly with things like backfill. (25Gbps if you can find that gear cheap for homelab - 50Gbps is a technological dead-end. 100Gbps works well).

But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe will work fine, and you should get acceptable performance, and it'll be good for learning.

You can install Ceph directly on bare-metal, or if you want to do the homelab k8s route, you can use Rook (https://rook.io/).

Hope this helps, and good luck! Let me know if you have any other questions.

eurekin2y ago

NUC with 10gbit eth - can you recommend any?

justinclift2y ago

If you want something cheap, you could go with Lenovo M720q's:

https://www.servethehome.com/lenovo-thinkcentre-m720q-tinymi...

They have a PCIe slot and can take 8th/9th gen intel cpus (6 core, etc). That PCIe slot should let you throw in a decent network card (eg 10GbE, 25GbE, etc).

reactordev2y ago

There's a blog post they did where they setup Ceph on some rPI 4's. I'd say that's not significant hardware at all. [1]

[1] https://ceph.io/en/news/blog/2022/install-ceph-in-a-raspberr...

m4632y ago

I think "significant" turns out to mean the number of nodes required.

mcronce2y ago

I run Ceph in my lab. It's pretty heavy on CPU, but it works well as long as you're willing to spring for fast networking (at least 10Gb, ideally 40+) and at least a few nodes with 6+ disks each if you're using spinners. You can probably get away with far fewer disks per node if you're going all-SSD.

aaronax2y ago

I just set up a three-node Proxmox+Ceph cluster a few weeks ago. Three Optiplex desktops 7040, 3060, and 7060 and 4x SSDs of 1TB and 2TB mix (was 5 until I noticed one of my scavenged SSDs was failed). Single 1gbps network on each so I am seeing 30-120MB/s disk performance depending on things. I think in a few months I will upgrade to 10gbps for about $400.

I'm about 1/2 through the process of moving my 15 virtual machines over. It is a little slow but tolerable. Not having to decide on RAIDs or a NAS ahead of time is amazing. I can throw disks and nodes at it whenever.

chomp2y ago

I’ve ran Ceph in my home lab since Jewel (~8 years ago). Currently up to 70TB storage on a single node. Have been pretty successful vertically scaling, but will have to add a 2nd node here in a bit.

Ceph isn’t the fastest, but it’s incredibly resilient and scalable. Haven’t needed any crazy hardware requirements, just ram and an i7.

sgarland2y ago

Yes. I first tried it with Rook, and that was a disaster, so I shifted to Longhorn. That has had its own share of problems, and is quite slow. Finally, I let Proxmox manage Ceph for me, and it’s been a dream. So far I haven’t migrated my K8s workloads to it, but I’ve used it for RDBMS storage (DBs in VMs), and it works flawlessly.

I don’t have an incredibly great setup, either: 3x Dell R620s (Ivy Bridge-era Xeons), and 1GBe. Proxmox’s corosync has a dedicated switch, but that’s about it. The disks are nice to be fair - Samsung PM863 3.84 TB NVMe. They are absolutely bottlenecked by the LAN at the moment.

I plan on upgrading to 10GBe as soon as I can convince myself to pay for an L3 10G switch.

sixdonuts2y ago

Just get a 25G switch and MM fiber. 25G switches are cheaper, use less power and can work with 10 and 25G SFPs.

sgarland2y ago

The main blocker (other than needing to buy new NICs, since everything I have already came with quad 1/1/10/10) is I'm heavily invested into the Ubiquiti ecosystem, and since they killed off the USW-Leaf (and the even more brief UDC-Leaf), they don't have anything that fits the bill.

I'm not entirely opposed to getting a Mikrotik or something and it just being the oddball out, but it's nice to have everything centrally managed.

EDIT: They do have the PRO-Aggregation, but there are only 4x 25G ports. Technically it _would_ meet my needs for Ceph, and Ceph only.

louwrentius2y ago

If you want decent performance, you need a lot of OSDs especially if you use HDD. But a lot of consumer SDDs will suffer terrible performance degradation with writes depending on the circumstances and workloads.

willglynn2y ago

The hardware minimums are real, and the complexity floor is significant. Do not deploy Ceph unless you mean it.

I started considering alternatives when my NAS crossed 100 TB of HDDs, and when a scary scrub prompted me to replace all the HDDs, I finally pulled the trigger. (ZFS resilvered everything fine, but replacing every disk sequentially gave me a lot of time to think.) Today I have far more HDD capacity and a few hundred terabytes of NVMe, and despite its challenges, I wouldn't dare run anything like it without Ceph.

samcat1162y ago

Can I ask what you use all that storage for on your NAS?

antongribok2y ago

I run Ceph on some Raspberry Pi 4s. It's super reliable, and with cephadm it's very easy[1] to install and maintain.

My household is already 100% on Linux, so having a native network filesystem that I can just mount from any laptop is very handy.

Works great over Tailscale too, so I don't even have to be at home.

[1] I run a large install of Ceph at work, so "easy" might be a bit relative.

dcplaya2y ago

What are your speeds? Do you rub ceph FS too?

I'm trying to do similar.

antongribok2y ago

It's been a while since I've done some benchmarks, but it can definitely do 40MB/s sustained writes, which is very good given the single 1GbE links on each node, and 5TB SMR drives.

Latency is hilariously terrible though. It's funny to open a text file over the network in vi, paste a long blob of text and watch it sync that line by line over the network.

If by "rub" you mean scrub, then yes, although I increased the scrub intervals. There's no need to scrub everything every week.

1 more reply

mikecoles2y ago

Works great, depending on what you want to do. Running on SBCs or computers with cheap sata cards will greatly reduce the performance. It's been running well for years after I found out the issues regarding SMR drives and the SATA card bottlenecks.

45Drives has a homelab setup if you're looking for a canned solution.

bluedino2y ago

Related question, how does someone get into working with Ceph? Other than working somewhere that already uses it.

hathawsh2y ago

The recommended way to set up Ceph is cephadm, a single-file Python script that is a multi-tool for both creating and administering clusters.

https://docs.ceph.com/en/latest/cephadm/

To learn about Ceph, I recommend you create at least 3 KVM virtual machines (using virt-manager) on a development box, network them together, and use cephadm to set up a cluster between the VMs. The RAM and storage requirements aren't huge (Ceph can run on Raspberry Pis, after all) and I find it a lot easier to figure things out when I have a desktop window for every node.

I recently set up Ceph twice. Now that Ceph (specifically RBD) is providing the storage for virtual machines, I can live-migrate VMs between hosts and reboot hosts (with zero guest downtime) anytime I need. I'm impressed with how well it works.

SteveNuts2y ago

You could start by installing Proxmox on old machines you have, it uses Ceph for its distributed storage, if you choose to use it.

candiddevmike2y ago

Look into the Rook project

mmerlin2y ago

Proxmox makes Ceph easy, even with just one single server if you are homelabbing...

I had 4 NUCs running Proxmox+Ceph for a few years, and apart from slightly annoying slowness syncing after spinning the machines up from cold start, it all ran very smoothly.

loeg2y ago

Why would you bother with a distributed filesystem when you don't have to?

nh22y ago

One reason for using Ceph instead of other RAID solutions on a single machine is that it supports disk failures more flexibly.

In most RAIDs (including ZFS's, to my knowledge), the set of disks that can fail together is static.

Say you have physical disks A B C D E F; common setup is to group RAID1'd disks into a pool such as `mirror(A, B) + mirror(C, D) + mirror(E, F)`.

With that, if disk A fails, and then later B fails before you replace A, your data is lost.

But with Ceph, and replication `size = 2`, when A fails, Ceph will (almost) immediately redistribute your data so that it has 2 replicas again, across all remaining disks B-F. So then B can fail and you still have your data.

So in Ceph, you give it a pool of disks and tell it to "figure out the replication" iself. Most other systems don't offer that; the human defines a static replication structure.

imiric2y ago

For the same reason you would use one in enterprise deployments: if setup properly, it's easier to scale. You don't need to invest in a huge storage server upfront, but could build it out as needed with cheap nodes. Assuming it works painlessly as a single node filesystem, of which I'm not yet convinced if the existing solutions do.

loeg2y ago

> if setup properly, it's easier to scale

For home use/needs, I think vertical scaling is much easier.

1 more reply

matheusmoreira2y ago

I'm indifferent towards the distributed nature thing. What I want is ceph's ability to pool any combination of drives of any make, model and capacity into organized redundant fault tolerant storage, and its ability to add arbitrary drives to that pool at any point in the system's lifetime. RAID-like solutions require identical drives and can't be easily expanded.

loeg2y ago

ZFS and BtrFS have some capability for this.

m4632y ago

lol, wrong place to ask questions of such practicality.

that said, I played with virtualization and I didn't need to.

but then I retired a machine or two and it has been very helpful.

And I used to just use physical disks and partitions. But with the VMs I started using volume manager. It became easier to grow and shrink storage.

and...

well, now a lot of this is second nature. I can spin up a new "machine" for a project and it doesn't affect anything else. I have better backups. I can move a virtual machine.

yeah, there are extra layers of abstraction but hey.

iwontberude2y ago

It's cool to cluster everything for some people (myself included). I see it more like a design constraint than a pure benefit.

erulabs2y ago

So that when you do have to, you know how to do it.

loeg2y ago

I think most of us will go our whole lives never having to deploy Ceph, especially at home.

1 more reply

m4632y ago

I think you need 3 or was it 5 machines?

proxmox will use it - just click to install

mrb2y ago

I wanted to see how 1 TiB/s compares to the actual theoretical limits of the hardware. So here is what I found:

The cluster has 68 nodes, each a Dell PowerEdge R6615 (https://www.delltechnologies.com/asset/en-us/products/server...). The R6615 configuration they run is the one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes have negligible ~3% overhead thanks to 128b-132b encoding.

This means each U.2 link has a maximum link bandwith of 16 * 4 = 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to be capable of 7 Gbyte/s read throughput (https://www.serversupply.com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not bottlenecked by the U.2 link (8 Gbyte/s).

Each node has 10 U.2 drive, so each node can do local read I/O at a maximum of 10 * 7 = 70 Gbyte/s.

However each node has a network bandwith of only 200 Gbit/s (2 x 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This implies that remote reads are under-utilizing the drives (capable of 70 Gbyte/s). The network is the bottleneck.

Assuming no additional network bottlenecks (they don't describe the network architecture), this implies the 68 nodes can provide 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1 TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of the maximum theoretical 1700 Gbyte/s. That's pretty decent, but in theory it's still possible to be doing a bit better assuming all nodes can concurrently truly saturate their 200 Gbit/s network link.

Reading this whole blog post, I got the impression ceph's complexity hits the CPU pretty hard. Not compiling a module with -O2 ("Fix Three": linked by the author: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can reduce performance "up to 5x slower with some workloads" (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure I/O workload. Also what's up with OSD's threads causing excessive CPU waste grabbing the IOMMU spinlock? I agree with the conclusion that the OSD threading model is suboptimal. A relatively simple synthetic 100% read benchmark should not expose a threading contention if that part of ceph's software architecture was well designed (which is fixable, so I hope the ceph devs prioritize this.)

markhpc2y ago

I wanted to chime in and mention that we've never seen any issues with IOMMU before in Ceph. We have a previous generation of the same 1U chassis from Dell with AMD Rome processors in the upstream ceph lab and they don't suffer from the same issue despite performing similarly at the same scale (~30 OSDs). The customer did say they've seen this in the past in their data center. I'm hoping we can work with AMD to figure out what's going on.

I did some work last summer kind of duct taping the OSD's existing threading model (double buffering the hand-off between async msgr and worker threads, adaptive thread wakeup, etc). I could achieve significant performance / efficiency gains under load, but at the expense of increased low-load latency (Ceph by default is very aggressive about waking up threads when new IO arrives for a given shard).

One of the other core developers and I discussed it and we both came to the conclusion that it probably makes sense to do a more thorough rewrite of the threading code.

magicalhippo2y ago

They're benchmarking random IO though, and the disks can "only" do a bit over 1000k random 4k read IOPS, which translates to about 5 GiB/s. With 320 OSDs thats around 1.6 TiB/s.

At least thats the number I could find. Not exactly tons of reviews on these enterprise NVMe disks...

Still, that seems like a good match to the NICs. At this scale most workloads will likely appear as random IO at the storage layer anyway.

mrb2y ago

The benchmark were they accomplish 1025 GiB/s is for sequential reads. For random reads they do 25.5M iops or ~100 GiB/s. See last table, column "630 OSDs (3x)".

magicalhippo2y ago

Oh wow how did I miss that table, cheers.

wmf2y ago

I think PCIe TLP overhead and NVMe commands account for the difference between 7 and 8 GB/s.

mrb2y ago

You are probably right. Reading some old notes of mine when I was fine-tuning PCIe bandwith on my ZFS server, I had discovered back then that a PCIe Max_Payload_Size of 256 bytes limited usable bandwidth to about 74% of the link's theoretical max. I had calculated that 512 and 1024 bytes (the maximum) would raise it to respectively about 86% and 93% (but my SATA controllers didn't support a value greater than 256.)

_zoltan_2y ago

Mellanox recommends setting this from the default 512 to 4096 on their NICs.

kaliszad2y ago

What surprises me is, why they went with the harder to cool 1U nodes and 10 SSDs/2x100Gb NICs instead of 2U nodes with 24 SSDs/2x200 or even 400Gb NICs. They could remove the network bottleneck and save on power thanks to larger, lower speed fans and less CPU packages, possibly with more cores per socket though. Also, having a smaller number of nodes increases the blast radius but with even 34 nodes this is probably not such a problem. However, with less nodes they could have a flatter network with 4 switches or so too.

PiratesScorn2y ago

Blast radius is the primary factor as you say and just generally makes things like patching and HW replacements less stressful. The racks and switches already exist and are heavily utilised for other purposes so the additional physical footprint for ceph is pretty tiny :)

mobilemidget2y ago

Cool benchmark, and interesting, however it would have read a lot better if abbreviations are explained at first usage. Not everybody is familiar with all terminology used in the post. Nonetheless congrats with results.

markhpc2y ago

Thanks (truly) for the feedback! I'll try to remember for future articles. It's easy to forget how much jargon we use after being in the field for so long.

juiiiced2y ago

Yea great post! Any post recommendations to clear up the acronyms?

one_buggy_boi2y ago

Is modern Ceph appropriate for transactional database storage, how is the IO latency? I'd like to move to a cheaper cfs that can compete with systems like Oracle's clustered file system or DBs backed by something like Veritas. Veritas supports multi-petabyte DBs and I haven't seen much outside of it or ocfs that similarly scales with acceptable latency

antongribok2y ago

Not sure about putting DBs on CephFS directly, but Ceph RBD can definitely run RDBMS workloads.

You need to pay attention to the kind of hardware you use, but you can definitely get Ceph down to 0.5-0.6 ms latency on block workloads doing single thread, single queue, sync 4K writes.

Source, I run Ceph at work doing pretty much this.

patrakov2y ago

It is important to specify which kind of latency percentile this is. Checking on a customer's cluster (made from 336 SATA SSDs in 15 servers, so not the best one in the world):

  50th percentile = 1.75 ms
  90th percentile = 3.15 ms
  99th percentile = 9.54 ms

That's with 700 MB/s of reads and 200 MB/s of writes, or approximately 7000 reads IOPS and 9000 writes IOPS.

louwrentius2y ago

These numbers may be good enough for your use case but from what’s possible with SSDs these numbers aren’t great. Please note, I mean well. Still a cool setup.

I’d like to see much more latency consistency and 99th even sub ms. Might want to set a latency target with fio and see what kind of load is possible until 99 hits 1ms.

However, I can say all of this but it’s all about context and depending on workload your figures may be totally fine.

samcat1162y ago

Latency is quite poor, I wouldn't recommend running high performance database loads there.

louwrentius2y ago

From my dated experience, Ceph is absolutely amazing but latency is indeed a relative weak spot.

Everything has a trade-off and for Ceph you get a ton of capability but latency is such a trade-off. Databases - depending on requirements - may be better off on regular NVMe and not on Ceph.

yencabulator2y ago

It's pretty unfair to compare latency of a local NVMe SSD to over-the-network 3x replicated storage. "It's faster if I do less."

[Disclaimer: ex-Inktank employee]

2 more replies

louwrentius2y ago

I wrote an intro to Ceph[0] for those who are new to Ceph.

It featured in a Jeff Geerling video briefly recently :-)

[0]: Understanding Ceph: open-source scalable storage https://louwrentius.com/understanding-ceph-open-source-scala...

justinclift2y ago

Has anything important changed since 2018, when you wrote that? :)

louwrentius2y ago

Conceptually not as far as I know.

rafaelturk2y ago

I'm playing a lot with MicroCeph. Its aopinionated low TOS, friendly setup of Ceph. Looking forward additional comments. Planning to use it in production and replace lots of NAS servers.

louwrentius2y ago

I think Ceph can be fine for NAS use cases, but be wary of latency and do some benchmarking. You may need more nodes/osds than you think to reach latency and throughput targets.

louwrentius2y ago

Remember, random IOPs without latency is a meaningless figure.

francoismassot2y ago

Does someone knows how Ceph compares to other object storage engine like MinIO/Garage/...?

I would love to see some benchmarks there.

matesz2y ago

This would be great, to have a universal benchmark of all available open source solutions for self-hosting. Links appreciated!

peter_d_sherman2y ago

Ceph is interesting... open source software whose only purpose is to implement a distributed file system...

Functionally, Linux implements a file system (well, several!) as well (in addition to many other OS features) -- but (usually!) only on top of local hardware.

There seems to be some missing software here -- if we examine these two paradigms side-by-side.

For example, what if I want a Linux (or more broadly, a general OS) -- but one that doesn't manage a local file system or local storage at all?

One that operates solely using the network, solely using a distributed file system that Ceph, or software like Ceph, would provide?

Conversely, what if I don't want to run a full OS on a network machine, a network node that manages its own local storage?

The only thing I can think of to solve those types of problems -- is:

What if the Linux filesystem was written such that it was a completely separate piece of software, and a distributed file system like Ceph, and not dependent on the other kernel source code (although, still complilable into the kernel as most linux components normally are)...

A lot of work? Probably!

But there seems to be some software need for something between a solely distributed file system as Ceph is, and a completely monolithic "everything baked in" (but not distributed!) OS/kernel as Linux is...

Note that I am just thinking aloud here -- I probably am wrong and/or misinformed on one or more fronts!

So, kindly take this random "thinking aloud" post -- with the proverbial "grain of salt!" :-)

wmf2y ago

what if I want a Linux ... that doesn't manage a local file system or local storage at all [but] operates solely using the network, solely using a distributed file system

Linux can boot from NFS although that's kind of lost knowledge. Booting from CephFS might even be possible if you put the right parts in the initrd.

lmz2y ago

NFS root docs here https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs...

peter_d_sherman2y ago

NFS is an excellent point!

NFS (now that I think about it!) -- brings up two additional software engineering considerations:

1) Distributed file system protocol.

2) Software that implements that distributed (or at least remote/network) file system -- via that file system protocol.

NFS is both.

That's not a bad thing(!) -- but ideally from a software engineering "separation of concerns" perspective, this future software layer/level would ideally be decoupled from the underlying protocol -- that is, it might have a "plug-in" protocol architecture, where various 3rd party file system protocols (somewhat analogous to drivers) could be "plugged-in"...

But NFS could definitely be used to boot/run Linux over the network, and is definitely a step in the right direction, and something worth evaluating for these purposes... its source code is definitely worth looking at...

So, an excellent point!

1 more reply

plagiarist2y ago

It sounds you want microkernels, and I agree, it would be nice.

nghnam2y ago

My old company ran public and private cloud with Openstack and Ceph. We had 20 Supermicro (24 disks per server) storage nodes and total capacity was 3PB. We learnt some experiences, especially a flapping disk made whole system performance degraded. Solution was removing bad sector disk as soon as possible.

einpoklum2y ago

Where can I read about the rationale for ceph as a project? I'm not familiar with it.

jacobwg2y ago

Not sure how common the use-case is, but we're using Ceph to effectively roll our own EBS inside AWS on top of i3en EC2 instances. For us it's about 30% cheaper than the base EBS cost, but provides access to 10x the IOPS of base gp3 volumes.

The downside is durability and operations - we have to keep Ceph alive and are responsible for making sure the data is persistent. That said, we're storing cache from container builds, so in the worst-case where we lose the storage cluster, we can run builds without cache while we restore.

jseutter2y ago

http://www.45drives.com/blog/ceph/what-is-ceph-why-our-custo... is a pretty good introduction. Basically you can take off-the-shelf hardware and keep expanding your storage cluster and ceph will scale fairly linearly up through hundreds of nodes. It is seeing quite a bit of use in things like Kubernetes and OpenShift as a cheap and cheerful alternative to SANs. It is not without complexity, so if you don't know you need it, it's probably not worth the hassle.

brobinson2y ago

I'm curious what the performance difference would be on a modern kernel.

PiratesScorn2y ago

For context, I’ve been leading the work on this cluster client-side (not the engineer that discovered the IOMMU fix) with Clyso.

There was no significant difference when testing between the latest HWE on Ubuntu 20.04 and kernel 6.2 on Ubuntu 22.04. In both cases we ran into the same IOMMU behaviour. Our tooling is all very much catered around Ubuntu so testing newer kernels with other distros just wasn’t feasible in the timescale we had to get this built. The plan was < 2 months from initial design to completion.

Awesome to see this on HN, we’re a pretty under-the-radar operation so there’s not much more I can say but proud to have worked on this!

brobinson2y ago

Hey, thanks for the response! There have been a lot of across the board improvements in the kernel in the last four years so I'm surprised there's not a noticeable performance improvements in 6.2 (although I also consider 6.2 old at this point).

riku_iki2y ago

What router/switch one would use for such speed?

NavinF2y ago

Linked article says they used 68 machines with 2 x 100GbE Mellanox ConnectX-6 cards. So any 100G pizza box switches should work.

Note that 36 port 56G switches are dirt cheap on eBay and 4tbps is good enough for most homelab use cases

riku_iki2y ago

> So any 100G pizza box switches should work.

but will it be able to handle combined TB/s traffic?

aaronax2y ago

Yes. Most network switches can handle all ports at 100% utilization in both directions simultaneously.

Take for example the Mellanox SX6790 available for less than $100 on eBay. It has 36 56gbps ports. 36 * 2 * 56 = 4032gbps and it is stated to have a switching capacity of 4.032Tbps.

Edit: I guess you are asking how one would possibly sip 1TiB/s of data into a given client. You would need multiple clients spread across several switches to generate such load. Or maybe some freaky link aggregation. 10x 800gbps links for your client, plus at least 10x 800gbps links out to the servers.

bombcar2y ago

Even the bargain Mikrotik can do 1.2Tbps https://mikrotik.com/product/crs518_16xs_2xq

2 more replies

baq2y ago

any switch which can't handle full load on all ports isn't worthy of the name 'switch', it's more like 'toy network appliance'

1 more reply

epistasis2y ago

Given their configuration of just 4U spread across 17 racks, there's likely a bunch of compute in the rest of the rack, and 1-2 top of rack switches like this:

https://www.qct.io/product/index/Switch/Ethernet-Switch/T700...

And then you connect the TOR switches to higher level switches in something like a Clos distribution to get the desired bandwidth between any two nodes:

https://www.techtarget.com/searchnetworking/definition/Clos-...

KeplerBoy2y ago

800Gbps via OSFP and QSFP-DD are already a thing. Multiple vendors have NICs and switches for that.

CyberDildonics2y ago

16x PCIe 4.0 is 32GB/s 16x PCIe 5.0 should be 64 GB/s, how is any computer using 100 GB/s ?

KeplerBoy2y ago

I was talking about Gigabit/s, not Gigabyte/s.

The article however actually talks about Terabyte/s scale, albeit not over a single node.

1 more reply

_zoltan_2y ago

can you show me a 800G NIC?

the switch is fine, I'm buying 64x800G switches, but NIC wise I'm limited to 400Gbit.

KeplerBoy2y ago

fair enough, it seems I was mistaken about the NIC. I guess that has to wait for PCIe 6 and should arrive soon-ish.

hinkley2y ago

Sure would be nice if you defined some acronyms.

up2isomorphism2y ago

This is an insanely expensive cluster built to show a benchmark. 68 node cluster serving only 15TB storage in total.

PiratesScorn2y ago

The purpose of the benchmarking was to validate the design of the cluster and to identify any issues before going into production, so it achieved exactly that objective. Without doing this work a lot of performance would have been left on the table before the cluster could even get out the door.

As per the blog, the cluster is now in a 6+2 EC configuration for production which gives ~7PiB usable. Expensive yes, but well worth it if this is the scale and performance required.

up2isomorphism2y ago

You are talking different thing. I don’t care what “purpose “ you want to achieve. I merely point out this performance number is mediocre at best, because the enormous computing power thrown at it , wether you like it or not.

To put it into perspective there are 68 nodes with 98 hard thread each, means only 1000/7000 = 140MB/s per thread or 280MB/s per core, and that’s not that impressive, to be honest.

markhpc2y ago

Hi, Author here.

Large reads tend to require the least CPU of all of the tests that we ran in the post. This is especially true in a 3X replication scenario where reads are serviced by a single OSD like in the 1 TiB/s test. CPU is far more important for small random writes, and also can be important when using erasure coding and/or msgr level encryption.

So the premise that you can only achieve 280MB/s per core is misleading. This cluster wasn't bottlenecked by the CPUs for large reads. Having said that, CPU makes up only a small portion of the overall cost for an NVMe deployment like this. Investing a relatively small amount of money to achieve a higher core to nvme ratio provides a better balance across all workloads and more flexibility when enabling features that consume additional CPU.

mrunkel2y ago

> This is an insanely expensive cluster built to show a benchmark. 68 node cluster serving only 15TB storage in total.

This reads to me (and the OP) that you are saying the purpose of this "insanely expensive cluster" was to "show a benchmark."

That's what OP is addressing in his response. No where do you mention anything about performance.

1 more reply

j / k navigate · click thread line to collapse

210 comments

alberth2y ago

Ceph has an interesting history.

It was created at Dreamhost (DH), for their internal needs by the founders.

DH was doing effectively IaaS & PaaS before those were industry coined words (VPS, managed OS/database/app-servers).

They spun Ceph off and Redhat bought it.

https://en.wikipedia.org/wiki/DreamHost

artyom2y ago

I think it was the university project of one of the founders, and the others jumped in supporting it. Docker has a similar origins story as far as I know.

pas2y ago

https://en.wikipedia.org/wiki/Sage_Weil right?

https://ceph.com/assets/pdfs/weil-crush-sc06.pdf

epistasis2y ago

A bit more to the story is that it was created also at UC Santa Cruz, by Sage Weil, a Dreamhost founder, while he was doing graduate work there. UCSC has had a lot of good storage research.

AdamJacobMuller2y ago

I ended up on the ceph IRC channel and eventually had Sage helping me fix the issues directly, helping me find bugs and writing patches to fix them in realtime.

Super amazingly nice guy that he was willing to help, never once chastised me for being so stupid (even though I was), also wicked smart.

antongribok2y ago

Sage is one of the nicest, down to earth, super smart individuals I've met.

I've talked to him at a few OpenStack and Ceph conferences, and he's always very patient answering questions.

dekhn2y ago

the fighting banana slugs

amadio2y ago

Nice article! We've also recently reached the mark of 1TB/s at CERN, but with EOS (https://cern.ch/eos), not ceph: https://www.home.cern/news/news/computing/exabyte-disk-stora...

Our EOS clusters have a lot more nodes, however, and use mostly HDDs. CERN also uses ceph extensively.

theyinwhy2y ago

Great! What's your take on ceph? Is the idea to migrate to EOS long term?

amadio2y ago

ComputerGuru2y ago

Is there a reason you run both and don't converge on one or the other?

1 more reply

stuff4ben2y ago

knicholes2y ago

j33zusjuice2y ago

Ad-tech, or?

knicholes2y ago

Yeah. Serving profiles for customized ad selection.

redrove2y ago

A Heketi man! I had the same experience around the same years, what a blast. Everything was so new..and broken!

CTrox2y ago

amluto2y ago

This could scale down to just a few nodes, and it reduces the exposure to a single failure taking out 10 disks at a time.

I bet a lot of copies of this system could fit in a 4U enclosure. Optionally the same enclosure could contain two entirely independent switches to aggregate the internal nodes.

evanreichard2y ago

I used to run a 5 node Ceph cluster on a bunch of ODROID-HC2's [0]. Was a royal pain to get installed (armhf processor). But once it was running it worked great. Just slow with the single 1Gb NIC.

Was just a learning experience at the time.

[0] https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/

eurekin2y ago

pl4nty2y ago

walrus012y ago

Palomides2y ago

here's a weird calculation:

this cluster does something vaguely like 0.8 gigabits per second per watt (1 terabyte/s * 8 bits per byte * 1024 gb per tb / 34 nodes / 300 watts

a new mac mini (super efficient arm system) runs around 10 watts in interactive usage and can do 10 gigabits per second network, so maybe 1 gigabit per second per watt of data

so OP's cluster, back of the envelope, is basically the same bits per second per watt that a very efficient arm system can do

I don't think running tiny nodes would actually get you any more efficiency, and would probably cost more! performance per watt is quite good on powerful servers now

anyway, this is all open source software running on off-the-shelf hardware, you can do it yourself for a few hundred bucks

georgyo2y ago

You're comparing one machine with many machines.

You're comparing raw disks with shards and erasure encouraging.

Lastly, you're comparing only network bandwidth and not storage capacity.

amluto2y ago

I think the Mac Mini has massively more compute than needed for this kind of work. It also has a power supply, and computer power supplies are generally not amazing at low output.

yencabulator2y ago

> or go all out and arrange for data to be DMAed directly between the disk and the NIC.

Ceph OSDs do a lot more work than you're imagining.

Palomides2y ago

1 more reply

hirako20002y ago

I checked selling prices of those racks + top end SSDs, this 1Tb/s achievement runs on $4 million worth of hardware cluster. Or more I didn't check the networking interface costs.

amluto2y ago

I don’t know what discount large buyers get, but I wouldn’t be surprised if it’s around 75%.

1 more reply

3abiton2y ago

Trusting your maths, damn Apple did a great job on their M design.

hirako20002y ago

Didn't ARM (the company, that originally designed ARM processors) do most of that job and Apple pushed perf to consumption even further?

jeffbee2y ago

booi2y ago

somat2y ago

narism2y ago

Sounds like the Seagate Kinetic HDD: https://www.seagate.com/www-content/product-content/hdd-fam/...

somat2y ago

progval2y ago

eurekin2y ago

Looks so good, wish for a > 1gbit version, since HDDs alone can saturate that

1 more reply

kbenson2y ago

wildylion2y ago

IIRC, WD has experimented with placing Ethernet and some compute directly onto hard drives some time back.

sigh I used to do some small-scale Ceph back in 2017 or so...

chx2y ago

There was a point in history when the total amount of digital data stored worldwide reached 1TiB for the first time. It is extremely likely this day was within the last sixty years.

And here we are moving that amount of data every second on the servers of a fairly random entity. We not talking of a nation state or a supranatural research effort.

qingcharles2y ago

That reminds me of a calculation I did which showed that my desktop PC would be more powerful than all of the computers on the planet combined in like 1978 :D

plagiarist2y ago

My phone has more computation than anything I would have imagined owning, and I sometimes turn on the screen just to use as a quick flashlight.

qingcharles2y ago

fiddlerwoaroof2y ago

It’s at least 20ish years ago: I remember an old sysadmin talking about managing petabytes before 2003

chx2y ago

Must be much more than 20ish years, some 2400 ft reels in the 60s stored a few megabytes, you only need 100 000s of those to reach a terabyte. https://en.wikipedia.org/wiki/IBM_7330

> a single 2400-foot tape could store the equivalent of some 50,000 punched cards (about 4,000,000 six-bit bytes).

In 1964 with the introduction of System/360 you are going a magnitude higher https://www.core77.com/posts/108573/A-Storage-Cabinet-Based-...

> It could store a maximum of 45MB on 2,400 feet

At this point you only need a few ten thousand reels in existence to reach a terabyte. So I strongly suspect the "terabyte point" was some time in the 1960s.

aspenmayer2y ago

Here’s another usage scenario with data usage numbers I found a while back.

https://web.archive.org/web/20200309221602/https://www.wired...

fiddlerwoaroof2y ago

Yeah, at the other end of the scale, it sounds like Apple is now managing exabytes: https://read.engineerscodex.com/p/how-apple-built-icloud-to-...

This is pretty mind-boggling to me.

ComputerGuru2y ago

I archived Hancock here over a decade ago, stumbled upon it via HN at the time if I’m not mistaken: https://github.com/mqudsi/hancock

1 more reply

chx2y ago

I raised this to retro se and https://retrocomputing.stackexchange.com/a/28322/3722 notes a TiB of digital data likely was reached in the 1930s with punch cards.

kylegalbraith2y ago

The best part is that it pretty much just works. Very little babysitting with the exception of the occasional fs trim or something.

It’s been a massive improvement for our caching system.

[0] https://depot.dev/blog/cache-v2-faster-builds

guywhocodes2y ago

e12e2y ago

Did you host ebs on bare metal? How are you hosting ceph - your own/rented metal, ec2 - VMs?

Wasn't immediately clear to me from the blog.

kylegalbraith2y ago

We started with AWS EBS volumes with BuildKit on EC2. We've now moved to BuildKit on EC2 and a Ceph storage cluster on bare metal EC2 instances.

MPSimmons2y ago

This has happened in multiple clusters, using rook/ceph as well as Longhorn.

matheusmoreira2y ago

Does anyone have experience running ceph in a home lab? Last time I looked into it, there were quite significant hardware requirements.

nullwarp2y ago

There still are. As someone who has done both production and homelab deployments: unless you are specifically just looking for experience with it and just setting up a demo - don't bother.

When it works, it works great - when it goes wrong it's a huge headache.

Edit: As just an edit, if distributed storage is just something you are interested in there are much better options for a homelab setup:

- seaweedfs has been rock solid for me for years in both small and huge scales. we actually moved our production ceph setup to this.

- longhorn was solid for me when i was in the k8s world

- glusterfs is still fine as long as you know what you are going into.

matheusmoreira2y ago

I just want to hoard data. I hate having to delete stuff to make space. Things disappear from the web every day. I should hold onto them.

My requirements for a storage solution are:

> Single root file system

> Storage device failure tolerance

> Gradual expansion capability

chromatin2y ago

> ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID.

if you don't mind the overhead of a "pool of mirrors" approach [1], then it is easy to expand storage by adding pairs of disks! This is how my home NAS is configured.

[1] https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...

3 more replies

deadbunny2y ago

Edit: Just realised someone else as already mentioned a pool or mirrors. Consider this another +1.

1 more reply

sekh602y ago

I've run Ceph at home since the jewel release. I migrated to it after running FreeNAS.

Snow_Falls2y ago

Also unionfs or similar can let you merge zfs and snapraid into one unified filesystem so you can place important data in zfs and unchanging archive data in snapraid.

bityard2y ago

If multiple nodes are not off the table, also look into seaweedfs.

Also consider how (or if) you are going to back up your hoard of data.

1 more reply

amadio2y ago

nijave2y ago

Not sure what the multidisk consensus is for btrfs now-a-days but adding/removing devices is trivial, you can do "offline" dedupe, and you can rebalance data if you change the disk config.

As an added bonus it's also in-tree so you don't have to worry about kernel updates breaking things

I think you can also potentially do btrfs+LVM and let LVM manage multi device. Not sure what performance looks like there, though

1 more reply

rglullis2y ago

> glusterfs is still fine as long as you know what you are going into.

camkego2y ago

Docker Swarm seems to be underrated, from a simplicity and reliability perspective, IMHO.

reactordev2y ago

I'd throw minio [1] in the list there as well for homelab k8s object storage.

[1] https://min.io/

speedgoose2y ago

Also garage. https://garagehq.deuxfleurs.fr/

1 more reply

plagiarist2y ago

1 more reply

bityard2y ago

dataangel2y ago

rglullis2y ago

woopwoop242y ago

1 more reply

cholmon2y ago

GlusterFS support looks to be permanently ending later this year.

https://access.redhat.com/support/policy/updates/rhs

For folks using GlusterFS currently, what's your plan after this year?

sob7272y ago

Curious, what do you mean by "know what you go into" re glusterfs?

asadhaider2y ago

I thought it was popular for people running Proxmox clusters

geerlingguy2y ago

It is, and if you have a few nodes with at least 10 GbE networking, it's certainly the best clustered storage option I can think of.

ianlevesque2y ago

victorhooi2y ago

I have some experience with Ceph, both for work, and with homelab-y stuff.

First, bear in mind that Ceph is a distributed storage system - so the idea is that you will have multiple nodes.

For learning, you can definitely virtualise it all on a single box - but you'll have a better time with discrete physical machines.

Also, Ceph does prefer physical access to disks (similar to ZFS).

But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe will work fine, and you should get acceptable performance, and it'll be good for learning.

You can install Ceph directly on bare-metal, or if you want to do the homelab k8s route, you can use Rook (https://rook.io/).

Hope this helps, and good luck! Let me know if you have any other questions.

eurekin2y ago

NUC with 10gbit eth - can you recommend any?

justinclift2y ago

If you want something cheap, you could go with Lenovo M720q's:

https://www.servethehome.com/lenovo-thinkcentre-m720q-tinymi...

They have a PCIe slot and can take 8th/9th gen intel cpus (6 core, etc). That PCIe slot should let you throw in a decent network card (eg 10GbE, 25GbE, etc).

reactordev2y ago

There's a blog post they did where they setup Ceph on some rPI 4's. I'd say that's not significant hardware at all. [1]

[1] https://ceph.io/en/news/blog/2022/install-ceph-in-a-raspberr...

m4632y ago

I think "significant" turns out to mean the number of nodes required.

mcronce2y ago

aaronax2y ago

chomp2y ago

Ceph isn’t the fastest, but it’s incredibly resilient and scalable. Haven’t needed any crazy hardware requirements, just ram and an i7.

sgarland2y ago

I plan on upgrading to 10GBe as soon as I can convince myself to pay for an L3 10G switch.

sixdonuts2y ago

Just get a 25G switch and MM fiber. 25G switches are cheaper, use less power and can work with 10 and 25G SFPs.

sgarland2y ago

I'm not entirely opposed to getting a Mikrotik or something and it just being the oddball out, but it's nice to have everything centrally managed.

EDIT: They do have the PRO-Aggregation, but there are only 4x 25G ports. Technically it _would_ meet my needs for Ceph, and Ceph only.

louwrentius2y ago

willglynn2y ago

The hardware minimums are real, and the complexity floor is significant. Do not deploy Ceph unless you mean it.

samcat1162y ago

Can I ask what you use all that storage for on your NAS?

antongribok2y ago

I run Ceph on some Raspberry Pi 4s. It's super reliable, and with cephadm it's very easy[1] to install and maintain.

My household is already 100% on Linux, so having a native network filesystem that I can just mount from any laptop is very handy.

Works great over Tailscale too, so I don't even have to be at home.

[1] I run a large install of Ceph at work, so "easy" might be a bit relative.

dcplaya2y ago

What are your speeds? Do you rub ceph FS too?

I'm trying to do similar.

antongribok2y ago

It's been a while since I've done some benchmarks, but it can definitely do 40MB/s sustained writes, which is very good given the single 1GbE links on each node, and 5TB SMR drives.

Latency is hilariously terrible though. It's funny to open a text file over the network in vi, paste a long blob of text and watch it sync that line by line over the network.

If by "rub" you mean scrub, then yes, although I increased the scrub intervals. There's no need to scrub everything every week.

1 more reply

mikecoles2y ago

45Drives has a homelab setup if you're looking for a canned solution.

bluedino2y ago

Related question, how does someone get into working with Ceph? Other than working somewhere that already uses it.

hathawsh2y ago

The recommended way to set up Ceph is cephadm, a single-file Python script that is a multi-tool for both creating and administering clusters.

https://docs.ceph.com/en/latest/cephadm/

SteveNuts2y ago

You could start by installing Proxmox on old machines you have, it uses Ceph for its distributed storage, if you choose to use it.

candiddevmike2y ago

Look into the Rook project

mmerlin2y ago

Proxmox makes Ceph easy, even with just one single server if you are homelabbing...

I had 4 NUCs running Proxmox+Ceph for a few years, and apart from slightly annoying slowness syncing after spinning the machines up from cold start, it all ran very smoothly.

loeg2y ago

Why would you bother with a distributed filesystem when you don't have to?

nh22y ago

One reason for using Ceph instead of other RAID solutions on a single machine is that it supports disk failures more flexibly.

In most RAIDs (including ZFS's, to my knowledge), the set of disks that can fail together is static.

Say you have physical disks A B C D E F; common setup is to group RAID1'd disks into a pool such as `mirror(A, B) + mirror(C, D) + mirror(E, F)`.

With that, if disk A fails, and then later B fails before you replace A, your data is lost.

So in Ceph, you give it a pool of disks and tell it to "figure out the replication" iself. Most other systems don't offer that; the human defines a static replication structure.

imiric2y ago

loeg2y ago

> if setup properly, it's easier to scale

For home use/needs, I think vertical scaling is much easier.

1 more reply

matheusmoreira2y ago

loeg2y ago

ZFS and BtrFS have some capability for this.

m4632y ago

lol, wrong place to ask questions of such practicality.

that said, I played with virtualization and I didn't need to.

but then I retired a machine or two and it has been very helpful.

And I used to just use physical disks and partitions. But with the VMs I started using volume manager. It became easier to grow and shrink storage.

and...

well, now a lot of this is second nature. I can spin up a new "machine" for a project and it doesn't affect anything else. I have better backups. I can move a virtual machine.

yeah, there are extra layers of abstraction but hey.

iwontberude2y ago

It's cool to cluster everything for some people (myself included). I see it more like a design constraint than a pure benefit.

erulabs2y ago

So that when you do have to, you know how to do it.

loeg2y ago

I think most of us will go our whole lives never having to deploy Ceph, especially at home.

1 more reply

m4632y ago

I think you need 3 or was it 5 machines?

proxmox will use it - just click to install

mrb2y ago

I wanted to see how 1 TiB/s compares to the actual theoretical limits of the hardware. So here is what I found:

Each node has 10 U.2 drive, so each node can do local read I/O at a maximum of 10 * 7 = 70 Gbyte/s.

markhpc2y ago

One of the other core developers and I discussed it and we both came to the conclusion that it probably makes sense to do a more thorough rewrite of the threading code.

magicalhippo2y ago

They're benchmarking random IO though, and the disks can "only" do a bit over 1000k random 4k read IOPS, which translates to about 5 GiB/s. With 320 OSDs thats around 1.6 TiB/s.

At least thats the number I could find. Not exactly tons of reviews on these enterprise NVMe disks...

Still, that seems like a good match to the NICs. At this scale most workloads will likely appear as random IO at the storage layer anyway.

mrb2y ago

The benchmark were they accomplish 1025 GiB/s is for sequential reads. For random reads they do 25.5M iops or ~100 GiB/s. See last table, column "630 OSDs (3x)".

magicalhippo2y ago

Oh wow how did I miss that table, cheers.

wmf2y ago

I think PCIe TLP overhead and NVMe commands account for the difference between 7 and 8 GB/s.

mrb2y ago

_zoltan_2y ago

Mellanox recommends setting this from the default 512 to 4096 on their NICs.

kaliszad2y ago

PiratesScorn2y ago

mobilemidget2y ago

markhpc2y ago

Thanks (truly) for the feedback! I'll try to remember for future articles. It's easy to forget how much jargon we use after being in the field for so long.

juiiiced2y ago

Yea great post! Any post recommendations to clear up the acronyms?

one_buggy_boi2y ago

antongribok2y ago

Not sure about putting DBs on CephFS directly, but Ceph RBD can definitely run RDBMS workloads.

You need to pay attention to the kind of hardware you use, but you can definitely get Ceph down to 0.5-0.6 ms latency on block workloads doing single thread, single queue, sync 4K writes.

Source, I run Ceph at work doing pretty much this.

patrakov2y ago

It is important to specify which kind of latency percentile this is. Checking on a customer's cluster (made from 336 SATA SSDs in 15 servers, so not the best one in the world):

  50th percentile = 1.75 ms
  90th percentile = 3.15 ms
  99th percentile = 9.54 ms

That's with 700 MB/s of reads and 200 MB/s of writes, or approximately 7000 reads IOPS and 9000 writes IOPS.

louwrentius2y ago

These numbers may be good enough for your use case but from what’s possible with SSDs these numbers aren’t great. Please note, I mean well. Still a cool setup.

I’d like to see much more latency consistency and 99th even sub ms. Might want to set a latency target with fio and see what kind of load is possible until 99 hits 1ms.

However, I can say all of this but it’s all about context and depending on workload your figures may be totally fine.

samcat1162y ago

Latency is quite poor, I wouldn't recommend running high performance database loads there.

louwrentius2y ago

From my dated experience, Ceph is absolutely amazing but latency is indeed a relative weak spot.

Everything has a trade-off and for Ceph you get a ton of capability but latency is such a trade-off. Databases - depending on requirements - may be better off on regular NVMe and not on Ceph.

yencabulator2y ago

It's pretty unfair to compare latency of a local NVMe SSD to over-the-network 3x replicated storage. "It's faster if I do less."

[Disclaimer: ex-Inktank employee]

2 more replies

louwrentius2y ago

I wrote an intro to Ceph[0] for those who are new to Ceph.

It featured in a Jeff Geerling video briefly recently :-)

[0]: Understanding Ceph: open-source scalable storage https://louwrentius.com/understanding-ceph-open-source-scala...

justinclift2y ago

Has anything important changed since 2018, when you wrote that? :)

louwrentius2y ago

Conceptually not as far as I know.

rafaelturk2y ago

I'm playing a lot with MicroCeph. Its aopinionated low TOS, friendly setup of Ceph. Looking forward additional comments. Planning to use it in production and replace lots of NAS servers.

louwrentius2y ago

I think Ceph can be fine for NAS use cases, but be wary of latency and do some benchmarking. You may need more nodes/osds than you think to reach latency and throughput targets.

louwrentius2y ago

Remember, random IOPs without latency is a meaningless figure.

francoismassot2y ago

Does someone knows how Ceph compares to other object storage engine like MinIO/Garage/...?

I would love to see some benchmarks there.

matesz2y ago

This would be great, to have a universal benchmark of all available open source solutions for self-hosting. Links appreciated!

peter_d_sherman2y ago

Ceph is interesting... open source software whose only purpose is to implement a distributed file system...

Functionally, Linux implements a file system (well, several!) as well (in addition to many other OS features) -- but (usually!) only on top of local hardware.

There seems to be some missing software here -- if we examine these two paradigms side-by-side.

For example, what if I want a Linux (or more broadly, a general OS) -- but one that doesn't manage a local file system or local storage at all?

One that operates solely using the network, solely using a distributed file system that Ceph, or software like Ceph, would provide?

Conversely, what if I don't want to run a full OS on a network machine, a network node that manages its own local storage?

The only thing I can think of to solve those types of problems -- is:

A lot of work? Probably!

Note that I am just thinking aloud here -- I probably am wrong and/or misinformed on one or more fronts!

So, kindly take this random "thinking aloud" post -- with the proverbial "grain of salt!" :-)

wmf2y ago

what if I want a Linux ... that doesn't manage a local file system or local storage at all [but] operates solely using the network, solely using a distributed file system

Linux can boot from NFS although that's kind of lost knowledge. Booting from CephFS might even be possible if you put the right parts in the initrd.

lmz2y ago

NFS root docs here https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs...

peter_d_sherman2y ago

NFS is an excellent point!

NFS (now that I think about it!) -- brings up two additional software engineering considerations:

1) Distributed file system protocol.

2) Software that implements that distributed (or at least remote/network) file system -- via that file system protocol.

NFS is both.

So, an excellent point!

1 more reply

plagiarist2y ago

It sounds you want microkernels, and I agree, it would be nice.

nghnam2y ago

einpoklum2y ago

Where can I read about the rationale for ceph as a project? I'm not familiar with it.

jacobwg2y ago

jseutter2y ago

brobinson2y ago

I'm curious what the performance difference would be on a modern kernel.

PiratesScorn2y ago

For context, I’ve been leading the work on this cluster client-side (not the engineer that discovered the IOMMU fix) with Clyso.

Awesome to see this on HN, we’re a pretty under-the-radar operation so there’s not much more I can say but proud to have worked on this!

brobinson2y ago

riku_iki2y ago

What router/switch one would use for such speed?

NavinF2y ago

Linked article says they used 68 machines with 2 x 100GbE Mellanox ConnectX-6 cards. So any 100G pizza box switches should work.

Note that 36 port 56G switches are dirt cheap on eBay and 4tbps is good enough for most homelab use cases

riku_iki2y ago

> So any 100G pizza box switches should work.

but will it be able to handle combined TB/s traffic?

aaronax2y ago

Yes. Most network switches can handle all ports at 100% utilization in both directions simultaneously.

Take for example the Mellanox SX6790 available for less than $100 on eBay. It has 36 56gbps ports. 36 * 2 * 56 = 4032gbps and it is stated to have a switching capacity of 4.032Tbps.

bombcar2y ago

Even the bargain Mikrotik can do 1.2Tbps https://mikrotik.com/product/crs518_16xs_2xq

2 more replies

baq2y ago

any switch which can't handle full load on all ports isn't worthy of the name 'switch', it's more like 'toy network appliance'

1 more reply

epistasis2y ago

Given their configuration of just 4U spread across 17 racks, there's likely a bunch of compute in the rest of the rack, and 1-2 top of rack switches like this:

https://www.qct.io/product/index/Switch/Ethernet-Switch/T700...

And then you connect the TOR switches to higher level switches in something like a Clos distribution to get the desired bandwidth between any two nodes:

https://www.techtarget.com/searchnetworking/definition/Clos-...

KeplerBoy2y ago

800Gbps via OSFP and QSFP-DD are already a thing. Multiple vendors have NICs and switches for that.

CyberDildonics2y ago

16x PCIe 4.0 is 32GB/s 16x PCIe 5.0 should be 64 GB/s, how is any computer using 100 GB/s ?

KeplerBoy2y ago

I was talking about Gigabit/s, not Gigabyte/s.

The article however actually talks about Terabyte/s scale, albeit not over a single node.

1 more reply

_zoltan_2y ago

can you show me a 800G NIC?

the switch is fine, I'm buying 64x800G switches, but NIC wise I'm limited to 400Gbit.

KeplerBoy2y ago

fair enough, it seems I was mistaken about the NIC. I guess that has to wait for PCIe 6 and should arrive soon-ish.

hinkley2y ago

Sure would be nice if you defined some acronyms.

up2isomorphism2y ago

This is an insanely expensive cluster built to show a benchmark. 68 node cluster serving only 15TB storage in total.

PiratesScorn2y ago

As per the blog, the cluster is now in a 6+2 EC configuration for production which gives ~7PiB usable. Expensive yes, but well worth it if this is the scale and performance required.

up2isomorphism2y ago

To put it into perspective there are 68 nodes with 98 hard thread each, means only 1000/7000 = 140MB/s per thread or 280MB/s per core, and that’s not that impressive, to be honest.

markhpc2y ago

Hi, Author here.

mrunkel2y ago

> This is an insanely expensive cluster built to show a benchmark. 68 node cluster serving only 15TB storage in total.

This reads to me (and the OP) that you are saying the purpose of this "insanely expensive cluster" was to "show a benchmark."

That's what OP is addressing in his response. No where do you mention anything about performance.

1 more reply

j / k navigate · click thread line to collapse