It was created at Dreamhost (DH), for their internal needs by the founders.
DH was doing effectively IaaS & PaaS before those were industry coined words (VPS, managed OS/database/app-servers).
They spun Ceph off and Redhat bought it.
I think it was the university project of one of the founders, and the others jumped in supporting it. Docker has a similar origins story as far as I know.
I ended up on the ceph IRC channel and eventually had Sage helping me fix the issues directly, helping me find bugs and writing patches to fix them in realtime.
Super amazingly nice guy that he was willing to help, never once chastised me for being so stupid (even though I was), also wicked smart.
I've talked to him at a few OpenStack and Ceph conferences, and he's always very patient answering questions.
Our EOS clusters have a lot more nodes, however, and use mostly HDDs. CERN also uses ceph extensively.
I bet some engineering effort could divide the whole thing by 10. Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+ sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an eMMC chip or SD slot for boot.
This could scale down to just a few nodes, and it reduces the exposure to a single failure taking out 10 disks at a time.
I bet a lot of copies of this system could fit in a 4U enclosure. Optionally the same enclosure could contain two entirely independent switches to aggregate the internal nodes.
Was just a learning experience at the time.
[0] https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/
The functionality: mixing various pool types on the same set of SSD's, different redundancy types (erasure coded, replicated) was very impressive. Now I can't help but look down at a RAID NAS in comparision. Still, some extra packages like the NFS exporter were not ready for the arm architecture
PCIe lanes are the bottleneck so far - even my $90 2TB SSDs are rated at 7GB/s on PCIe 4x4. So I don't think SBCs are the optimal solution yet. Looks like Ampere's Altra line can do PCIe 4x128 at 40W so a 1U blade with 100G networking could be interesting. I've seen lots of bugs and missing optimisations with ARM though, even in a homelab, so this kind of solution might not be ready for datacenters yet
this cluster does something vaguely like 0.8 gigabits per second per watt (1 terabyte/s * 8 bits per byte * 1024 gb per tb / 34 nodes / 300 watts
a new mac mini (super efficient arm system) runs around 10 watts in interactive usage and can do 10 gigabits per second network, so maybe 1 gigabit per second per watt of data
so OP's cluster, back of the envelope, is basically the same bits per second per watt that a very efficient arm system can do
I don't think running tiny nodes would actually get you any more efficiency, and would probably cost more! performance per watt is quite good on powerful servers now
anyway, this is all open source software running on off-the-shelf hardware, you can do it yourself for a few hundred bucks
You're comparing raw disks with shards and erasure encouraging.
Lastly, you're comparing only network bandwidth and not storage capacity.
I’m imagining something quite specialized. Use a low frequency CPU with either vector units or even DMA engines optimized for the specific workloads needed, or go all out and arrange for data to be DMAed directly between the disk and the NIC.
But yeah could run on commodity hardware. Not sure those highly efficient arm packaged for a premium from Apple would beat the Dell racks though regarding throughput relative to hardware investment costs.
The hardkernel HC2 SOC was a nearly ideal form factor for this, and I still have a stack of them laying around that I bought to make a ceph cluster, but I ran out of steam when I figured out they were 32bit. not to say it would be impossible I just never did it.
If you want to try it with a more modern (and 64-bits) device, the hardkernel HC4 might do it for you. It's conceptually similar to the HC2 but has two drives. Unfortunately it only has double the RAM (4GB), which is probably not enough anymore.
If looked at as a whole, it appears to be more about whether you're combining resources at a low level (on the PCI bus on nodes) or a high level (in the switching infrastructure), and we should be careful not to push power (or complexity, as is often a similar goal) to a separate part of the system that is out of our immediate thoughts but still very much part of the system. Then again, sometimes parts of the system are much better at handling the complexity for certain cases, so in those cases that can be a definite win.
sigh I used to do some small-scale Ceph back in 2017 or so...
And here we are moving that amount of data every second on the servers of a fairly random entity. We not talking of a nation state or a supranatural research effort.
> a single 2400-foot tape could store the equivalent of some 50,000 punched cards (about 4,000,000 six-bit bytes).
In 1964 with the introduction of System/360 you are going a magnitude higher https://www.core77.com/posts/108573/A-Storage-Cabinet-Based-...
> It could store a maximum of 45MB on 2,400 feet
At this point you only need a few ten thousand reels in existence to reach a terabyte. So I strongly suspect the "terabyte point" was some time in the 1960s.
Here’s another usage scenario with data usage numbers I found a while back.
> A 2004 paper published in ACM Transactions on Programming Languages and Systems shows how Hancock code can sift calling card records, long distance calls, IP addresses and internet traffic dumps, and even track the physical movements of mobile phone customers as their signal moves from cell site to cell site.
> With Hancock, "analysts could store sufficiently precise information to enable new applications previously thought to be infeasible," the program authors wrote. AT&T uses Hancock code to sift 9 GB of telephone traffic data a night, according to the paper.
https://web.archive.org/web/20200309221602/https://www.wired...
The best part is that it pretty much just works. Very little babysitting with the exception of the occasional fs trim or something.
It’s been a massive improvement for our caching system.
Wasn't immediately clear to me from the blog.
This has happened in multiple clusters, using rook/ceph as well as Longhorn.
When it works, it works great - when it goes wrong it's a huge headache.
Edit: As just an edit, if distributed storage is just something you are interested in there are much better options for a homelab setup:
- seaweedfs has been rock solid for me for years in both small and huge scales. we actually moved our production ceph setup to this.
- longhorn was solid for me when i was in the k8s world
- glusterfs is still fine as long as you know what you are going into.
My requirements for a storage solution are:
> Single root file system
> Storage device failure tolerance
> Gradual expansion capability
The problem with every storage solution I've ever seen is the lack of gradual expandability. I'm not a corporation, I'm just a guy. I don't have the money to buy 200 hard disks all at once. I need to gradually expand capacity as needed.
I was attracted to this ceph because it apparently allows you to throw a bunch of drives of any make and model at it and it just pools them all up without complaining. The complexity is nightmarish though.
ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID. Expansion features seem to be just about to land for quite a few years now. I remember getting excited about it after seeing news here only for people to deflate my expectations. Btrfs has a flexible block allocator which is just what I need but... It's btrfs.
Does that include storage volumes for databases? I was using glusterFS as a way to scale my swarm cluster horizontally and I am reasonably sure that it corrupted one database to the point I lost more than a few hours of data. I was quite satisfied with the setup until I hit that.
I know that I am considered crazy for sticking with Docker Swarm until now, but aside from this lingering issue with how to manage stateful services, I've honestly don't feel the need to move yet to k8s. My clusters is ~10 nodes running < 30 stacks and it's not like I have tens of people working with me on it.
[1] https://min.io/
https://access.redhat.com/support/policy/updates/rhs
Note that the Red Hat Gluster Storage product has a defined support lifecycle through to 31-Dec-24, after which the Red Hat Gluster Storage product will have reached its EOL. Specifically, RHGS 3.5 represents the final supported RHGS series of releases.
For folks using GlusterFS currently, what's your plan after this year?
I recently tried ceph in a homelab setup, gave up because of complexity, and settled on glusterfs. I'm not a pro though, so I'm not sure if there's any shortcomings that are clear to everybody but me, hence why your comment caught my attention.
First, bear in mind that Ceph is a distributed storage system - so the idea is that you will have multiple nodes.
For learning, you can definitely virtualise it all on a single box - but you'll have a better time with discrete physical machines.
Also, Ceph does prefer physical access to disks (similar to ZFS).
And you do need decent networking connectivity - I think that's the main thing people think of, when they think of high hardware requirements for Ceph. Ideally 10Gbe at the minimum - although more if you want higher performance - there can be a lot of network traffic, particularly with things like backfill. (25Gbps if you can find that gear cheap for homelab - 50Gbps is a technological dead-end. 100Gbps works well).
But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe will work fine, and you should get acceptable performance, and it'll be good for learning.
You can install Ceph directly on bare-metal, or if you want to do the homelab k8s route, you can use Rook (https://rook.io/).
Hope this helps, and good luck! Let me know if you have any other questions.
[1] https://ceph.io/en/news/blog/2022/install-ceph-in-a-raspberr...
I'm about 1/2 through the process of moving my 15 virtual machines over. It is a little slow but tolerable. Not having to decide on RAIDs or a NAS ahead of time is amazing. I can throw disks and nodes at it whenever.
Ceph isn’t the fastest, but it’s incredibly resilient and scalable. Haven’t needed any crazy hardware requirements, just ram and an i7.
I don’t have an incredibly great setup, either: 3x Dell R620s (Ivy Bridge-era Xeons), and 1GBe. Proxmox’s corosync has a dedicated switch, but that’s about it. The disks are nice to be fair - Samsung PM863 3.84 TB NVMe. They are absolutely bottlenecked by the LAN at the moment.
I plan on upgrading to 10GBe as soon as I can convince myself to pay for an L3 10G switch.
I started considering alternatives when my NAS crossed 100 TB of HDDs, and when a scary scrub prompted me to replace all the HDDs, I finally pulled the trigger. (ZFS resilvered everything fine, but replacing every disk sequentially gave me a lot of time to think.) Today I have far more HDD capacity and a few hundred terabytes of NVMe, and despite its challenges, I wouldn't dare run anything like it without Ceph.
My household is already 100% on Linux, so having a native network filesystem that I can just mount from any laptop is very handy.
Works great over Tailscale too, so I don't even have to be at home.
[1] I run a large install of Ceph at work, so "easy" might be a bit relative.
45Drives has a homelab setup if you're looking for a canned solution.
https://docs.ceph.com/en/latest/cephadm/
To learn about Ceph, I recommend you create at least 3 KVM virtual machines (using virt-manager) on a development box, network them together, and use cephadm to set up a cluster between the VMs. The RAM and storage requirements aren't huge (Ceph can run on Raspberry Pis, after all) and I find it a lot easier to figure things out when I have a desktop window for every node.
I recently set up Ceph twice. Now that Ceph (specifically RBD) is providing the storage for virtual machines, I can live-migrate VMs between hosts and reboot hosts (with zero guest downtime) anytime I need. I'm impressed with how well it works.
I had 4 NUCs running Proxmox+Ceph for a few years, and apart from slightly annoying slowness syncing after spinning the machines up from cold start, it all ran very smoothly.
In most RAIDs (including ZFS's, to my knowledge), the set of disks that can fail together is static.
Say you have physical disks A B C D E F; common setup is to group RAID1'd disks into a pool such as `mirror(A, B) + mirror(C, D) + mirror(E, F)`.
With that, if disk A fails, and then later B fails before you replace A, your data is lost.
But with Ceph, and replication `size = 2`, when A fails, Ceph will (almost) immediately redistribute your data so that it has 2 replicas again, across all remaining disks B-F. So then B can fail and you still have your data.
So in Ceph, you give it a pool of disks and tell it to "figure out the replication" iself. Most other systems don't offer that; the human defines a static replication structure.
that said, I played with virtualization and I didn't need to.
but then I retired a machine or two and it has been very helpful.
And I used to just use physical disks and partitions. But with the VMs I started using volume manager. It became easier to grow and shrink storage.
and...
well, now a lot of this is second nature. I can spin up a new "machine" for a project and it doesn't affect anything else. I have better backups. I can move a virtual machine.
yeah, there are extra layers of abstraction but hey.
proxmox will use it - just click to install
The cluster has 68 nodes, each a Dell PowerEdge R6615 (https://www.delltechnologies.com/asset/en-us/products/server...). The R6615 configuration they run is the one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes have negligible ~3% overhead thanks to 128b-132b encoding.
This means each U.2 link has a maximum link bandwith of 16 * 4 = 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to be capable of 7 Gbyte/s read throughput (https://www.serversupply.com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not bottlenecked by the U.2 link (8 Gbyte/s).
Each node has 10 U.2 drive, so each node can do local read I/O at a maximum of 10 * 7 = 70 Gbyte/s.
However each node has a network bandwith of only 200 Gbit/s (2 x 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This implies that remote reads are under-utilizing the drives (capable of 70 Gbyte/s). The network is the bottleneck.
Assuming no additional network bottlenecks (they don't describe the network architecture), this implies the 68 nodes can provide 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1 TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of the maximum theoretical 1700 Gbyte/s. That's pretty decent, but in theory it's still possible to be doing a bit better assuming all nodes can concurrently truly saturate their 200 Gbit/s network link.
Reading this whole blog post, I got the impression ceph's complexity hits the CPU pretty hard. Not compiling a module with -O2 ("Fix Three": linked by the author: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can reduce performance "up to 5x slower with some workloads" (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure I/O workload. Also what's up with OSD's threads causing excessive CPU waste grabbing the IOMMU spinlock? I agree with the conclusion that the OSD threading model is suboptimal. A relatively simple synthetic 100% read benchmark should not expose a threading contention if that part of ceph's software architecture was well designed (which is fixable, so I hope the ceph devs prioritize this.)
I did some work last summer kind of duct taping the OSD's existing threading model (double buffering the hand-off between async msgr and worker threads, adaptive thread wakeup, etc). I could achieve significant performance / efficiency gains under load, but at the expense of increased low-load latency (Ceph by default is very aggressive about waking up threads when new IO arrives for a given shard).
One of the other core developers and I discussed it and we both came to the conclusion that it probably makes sense to do a more thorough rewrite of the threading code.
At least thats the number I could find. Not exactly tons of reviews on these enterprise NVMe disks...
Still, that seems like a good match to the NICs. At this scale most workloads will likely appear as random IO at the storage layer anyway.
You need to pay attention to the kind of hardware you use, but you can definitely get Ceph down to 0.5-0.6 ms latency on block workloads doing single thread, single queue, sync 4K writes.
Source, I run Ceph at work doing pretty much this.
50th percentile = 1.75 ms
90th percentile = 3.15 ms
99th percentile = 9.54 ms
That's with 700 MB/s of reads and 200 MB/s of writes, or approximately 7000 reads IOPS and 9000 writes IOPS.Everything has a trade-off and for Ceph you get a ton of capability but latency is such a trade-off. Databases - depending on requirements - may be better off on regular NVMe and not on Ceph.
It featured in a Jeff Geerling video briefly recently :-)
[0]: Understanding Ceph: open-source scalable storage https://louwrentius.com/understanding-ceph-open-source-scala...
I would love to see some benchmarks there.
Functionally, Linux implements a file system (well, several!) as well (in addition to many other OS features) -- but (usually!) only on top of local hardware.
There seems to be some missing software here -- if we examine these two paradigms side-by-side.
For example, what if I want a Linux (or more broadly, a general OS) -- but one that doesn't manage a local file system or local storage at all?
One that operates solely using the network, solely using a distributed file system that Ceph, or software like Ceph, would provide?
Conversely, what if I don't want to run a full OS on a network machine, a network node that manages its own local storage?
The only thing I can think of to solve those types of problems -- is:
What if the Linux filesystem was written such that it was a completely separate piece of software, and a distributed file system like Ceph, and not dependent on the other kernel source code (although, still complilable into the kernel as most linux components normally are)...
A lot of work? Probably!
But there seems to be some software need for something between a solely distributed file system as Ceph is, and a completely monolithic "everything baked in" (but not distributed!) OS/kernel as Linux is...
Note that I am just thinking aloud here -- I probably am wrong and/or misinformed on one or more fronts!
So, kindly take this random "thinking aloud" post -- with the proverbial "grain of salt!" :-)
Linux can boot from NFS although that's kind of lost knowledge. Booting from CephFS might even be possible if you put the right parts in the initrd.
The downside is durability and operations - we have to keep Ceph alive and are responsible for making sure the data is persistent. That said, we're storing cache from container builds, so in the worst-case where we lose the storage cluster, we can run builds without cache while we restore.
There was no significant difference when testing between the latest HWE on Ubuntu 20.04 and kernel 6.2 on Ubuntu 22.04. In both cases we ran into the same IOMMU behaviour. Our tooling is all very much catered around Ubuntu so testing newer kernels with other distros just wasn’t feasible in the timescale we had to get this built. The plan was < 2 months from initial design to completion.
Awesome to see this on HN, we’re a pretty under-the-radar operation so there’s not much more I can say but proud to have worked on this!
Note that 36 port 56G switches are dirt cheap on eBay and 4tbps is good enough for most homelab use cases
but will it be able to handle combined TB/s traffic?
https://www.qct.io/product/index/Switch/Ethernet-Switch/T700...
And then you connect the TOR switches to higher level switches in something like a Clos distribution to get the desired bandwidth between any two nodes:
https://www.techtarget.com/searchnetworking/definition/Clos-...
the switch is fine, I'm buying 64x800G switches, but NIC wise I'm limited to 400Gbit.
As per the blog, the cluster is now in a 6+2 EC configuration for production which gives ~7PiB usable. Expensive yes, but well worth it if this is the scale and performance required.
To put it into perspective there are 68 nodes with 98 hard thread each, means only 1000/7000 = 140MB/s per thread or 280MB/s per core, and that’s not that impressive, to be honest.