The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.
In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.
The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.
I'm not trying to start trouble, only raising awareness because in some environments such a thing matters
I was expecting C/C++ or Rust, pleasantly surprised to see Go.
why you would base64 encode them, they all store binary formats?
It will be great if you can share name or reference of some papers around this. Thank you in advance.
(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)
SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper. The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.
When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.
There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.
just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.
How is that not mmap?
Also what is the difference between a file, an object, a blob, a filesystem and an object store? Is all this just files indexed with sql?
The allocated storage is append only. For updates, just allocate another blob. The deleted blobs would be garbage collected later. So it is not really mmap.
> Also what is the difference between a file, an object, a blob, a filesystem and an object store?
The answer would be too long to fit here. Maybe chatgpt can help. :)
> Is all this just files indexed with sql?
Sort of yes.
I think SeaweedFS would really benefit from more documentation on what exactly it does.
People who want to deploy production systems need that, and it would also help potential contributors.
Some examples:
* It says "optimised for small files", but it is not super clear from the whitepaper and other documentation what that means. It mostly talks about about how small the per-file overhad is, but that's not enough. For example, on Ceph I can also store 500M files without problem, but then later discover that some operations that happen only infrequently, such as recovery or scrubs, are O(files) and thus have O(files) many seeks, which can mean 2 months of seeks for a recovery of 500M files to finish. ("Recovery" here means when a replica fails and the data is copied to another replica.)
* More on small files: Assuming small files are packed somehow to solve the seek problem, what happens if I delete some files in the middle of the pack? Do I get fragmentation (space wasted by holes)? If yes, is there a defragmentation routine?
* One page https://github.com/seaweedfs/seaweedfs/wiki/Replication#writ... says "volumes are append only", which suggests that there will be fragmentation. But here I need to piece together info from different unrelated pages in order to answer a core question about how SeaweedFS works.
* https://github.com/seaweedfs/seaweedfs/wiki/FAQ#why-files-ar... suggests that "vacuum" is the defragmentation process. It says it triggers automatically when deleted-space overhead reaches 30%. But what performance implications does a vacuum have, can it take long and block some data access? This would be the immediate next question any operator would have.
* Scrubs and integrity: It is common for redundant-storage systems (md-RAID, ZFS, Ceph) to detect and recover from bitrot via checksums and cross-replica comparisons. This requires automatic regular inspections of the stored data ("scrubs"). For SeaweedFS, I can find no docs about it, only some Github issues (https://github.com/seaweedfs/seaweedfs/issues?q=scrub) that suggest that there is some script that runs every 17 minutes. But looking at that script, I can't find which command is doing the "repair" action. Note that just having checksums is not enough for preventing bitrot: It helps detect it, but does not guarantee that the target number of replicas is brought back up (as it may take years until you read some data again). For that, regular scrubs are needed.
* Filers: For a production store of a highly-available POSIX FUSE mount I need to choose a suitable Filer backend. There's a useful page about these on https://github.com/seaweedfs/seaweedfs/wiki/Filer-Stores. But they are many, and information is limited to ~8 words per backend. To know how a backend will perform, I need to know both the backend well, and also how SeaweedFS will use it. I will also be subject to the workflows of that backend, e.g. running and upgrading a large HA Postgres is unfortunately not easy. As another example, Postgres itself also does not scale beyond a single machine, unless one uses something like Citus, and I have no info on whether SeaweedFS will work with that.
* The word "Upgrades" seems generally un-mentioned in Wiki and README. How are forward and backward compatibility handled? Can I just switch SeaweedFS versions forward and backward and expect everything will automatically work? For Ceph there are usually detailed instructions on how one should upgrade a large cluster and its clients.
In general the way this should be approached is: Pretend to know nothing about SeaweedFS, and imagine what a user that wants to use it in production wants to know, and what their followup questions would be.
Some parts of that are partially answered in the presentations, but it is difficult to piece together how a software currently works from presentations of different ages (maybe they are already outdated?) and the presentations are also quite light on infos (usually only 1 slide per topic). I think the Github Wiki is a good way to do it, but it too, is too light on information and I'm not sure it has everything that's in the presentations.
I understand the README already says "more tools and documentation", I just want to highlight how important the "what does it do and how does it behave" part of documentation is for software like this.
All projects were either cancelled, features cut, or officially left in limbo.
It's a pretty remarkable piece of Microsoft history as it has been there on the sidelines since roughly post-Windows 3.11. The reason they returned to it so often was in part because Bill Gates loved the idea of a more high level object storage that, like this, bridges the gap between files and databases.
He would probably have loved this kind of technology part of Windows -- and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft, that it was ahead its time and that it would re-emerge.
Failing to capture any of the mobile handset market while missing out almost entirely on search and social media businesses would be higher on my list if I were in BG's shoes.
Having a bigger presence in mobile and social would have been more lucrative, but from a CS geek point of view, the failure of WinFS might have been more stinging.
MS carved out economic rent on business with Windows and Office, Apple actually failed at that.
Seaweed has been running my k8s persistent volumes pretty admirably for like a year for about 4 devs.
My one complaint is that I could not really get it to work with an S3 compatible api that wasn’t officially supported in the list of S3 backends, even though that should have been theoretically possible. I ended up picking a supported backend instead.
So far I have been testing with seaweed and it seems to chug along fine at around ~4B files and it is still increasing.
Has minio improved on that lately ?
https://garagehq.deuxfleurs.fr/
I use it for self hosting.
What we need and haven't identified yet is an SDS system that provides both fully-compliant POSIX FS and S3 volumes, is FOSS, a production story where individuals can do all tasks competently/quickly/effectively (management, monitoring, disaster recovery incl. erasure coding and tooling), and CSI drivers that work with Nomad.
This rules out Ceph and friends. GarageFS, also mentioned in this thread, is S3 only. We went through everything applicable on the K8S drivers list https://kubernetes-csi.github.io/docs/drivers.html except for Minio, because it claimed it needed a backing store anyways (like Ceph) although just a few days ago I encountered Minio being used standalone.
While I'm on this topic, I noticed that the CSI drivers (SeaweedFS and SDS's in general) use tremendous resources when effecting mounts, instantiating nearly a whole OS (OCI image w/ functional userland) just to mount over what appears to be NFS or something.
A key point was to effectively treat S3 as a huge, reliable disk with 10MB "sectors". So the bucket would contain tons of 10MB chunks and ZFS would let S3 handle the redundancy. For performance it was coupled with a large, local SSD-based write-back cache.
Sadly it seems the company behind this figured it needed to keep this closed-source in order to get ROI[2].
I've never used it myself and just learned about it from this thread but it seems to fit the bill.
Seaweed had much better performance for our use case.
But with MinIO and erasure coding a single PUT results in more IOPS and we saw lower performance.
Also, expanding MinIO must be done in increments of your original buildout which is annoying. So if you start with 4 servers and 500TB, they recommend you expand by adding another 4 servers with 500TB at least.
If you could figure out the distributed part (and inconsistency in disk size and such), then this is a very nice system to have.
I use ZFS for most of my things, but I have yet to find a good way of just sharing a ZFS dataset over S3.
Drop in S3 compatibility with much better performance would be insane
We are only a couple months in and haven’t had to add to our cluster yet, storing about 250TB, so it’s still early for us. Promising so far and hardware has already paid for itself.
1) are you _really_ sure you need it distributed, or can you shard it your self? (hint, distributed anything sucks at least one if not two innovation tokens, if you're using other innovation tokens as well. you're going to have a very bad time)
2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)
3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)
4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)
5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)
6) do you care about availability, consistency or speed? (pick one, maybe one and a half)
7) how are you going to recover from the distributed storage shitting it's self all at the same time
8) how are you going to control access?
2) No modifications, just new files and the occasional deletion request.
3) Almost just 1 write and 1 read per file, this is a backing storage for the source files, and they are cached in front.
4) Never
5) Files are written only by one other server, and there will be no parallel writes.
6) I pick consistency and as the half, availability.
7) This happened something like 15 years ago with MogileFS and thus scared us away. (Hence the single-server ZFS setup).
8) Reads are public, writes restricted to one other service that may write.
_early_ lustre (its much better now)
GPFS
Gluster (fuck that)
clustered XFS (double fuck that)
Isilon
Nowadays, a single 2u server can realistically support 2x 100gig nics at full bore. So the biggest barrier is density. You can probably get 1pb in a rack now, and linking a bunch of jbods(well NVMEs) is probably easily to do now.
[2] https://dzone.com/articles/seaweedfs-vs-juicefs-in-design-an...
[0] https://juicefs.com/docs/community/reference/how_to_set_up_o...
(This is HPC work with large processing pipelines. I keep track of if the job was successful based upon if the final file exists. The rename only happens if the job was successful. It’s a great way to track pipeline status, but metadata lookups can be a pain — particularly for missing files. )
One similar use case used Cassandra as SeaweedFS filer store, and created thousands of files per second in a temp folder, and moved the files to a final folder. It caused a lot of tombstones for the updates in Cassandra.
Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.
Otherwise: egress costs.
I think since v0.7— I was always intrigued by Facebook's Haystack.
SeaweedFS been super reliable, efficient, and trouble free.
Now I only need to wait 10 years until all the hidden but crucial bugs are found (at the massive loss of real data, ofc) before I'm ready to use it,
like with every new piece of technology...
Or what should give me the confidence that it isn't so?