SeaweedFS fast distributed storage system for blobs, objects, files and datalake (opens in new tab)

(github.com)

328 pointsthushanfernando2y ago123 comments

123 comments

SeaweedFS does the thing: I've used it to store billions of medium-sized XML documents, image thumbnails, PDF files, etc. It fills the gap between "databases" (broadly defined; maybe you can do few-tens-KByte docs but stretching things) and "filesystems" (hard/inefficient in reality to push beyond tens/hundreds of millions of objects; yes I know it is possible with tuning, etc, but SeaweedFS is better-suited).

The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.

In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.

The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.

seized2y ago

GarageS3 is a nice middle ground, it is not file on disk per object but it's simpler than SeaweedFS as well.

https://garagehq.deuxfleurs.fr/

mdaniel2y ago

One will want to be cognizant that Garage, like recent MinIO releases, is AGPL https://git.deuxfleurs.fr/Deuxfleurs/garage/src/tag/v0.9.1/L...

I'm not trying to start trouble, only raising awareness because in some environments such a thing matters

anonzzzies2y ago

Yes, garage sourcecode is very easy to read and understand. Didn’t read seaweed yet.

ddorian432y ago

Garage has no intention to support erasure coding though.

no_wizard2y ago

Written in Go no less, a GC language!

I was expecting C/C++ or Rust, pleasantly surprised to see Go.

maayank2y ago

Why pleasantly surprised compared to Rust? What’s the significance of GCing?

1 more reply

riku_iki2y ago

> almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc

why you would base64 encode them, they all store binary formats?

pilgrim02y ago

I was quite surprised to discover that minio is one file per object. Having read some papers about object stores, this is definitely not what I expected.

blr_lpm2y ago

What are the pros/cons of storing one file per object? As a noob in this domain, this made sense to me.

It will be great if you can share name or reference of some papers around this. Thank you in advance.

4 more replies

vdm2y ago

This has not been true since 2021. https://blog.min.io/minio-optimizes-small-objects/

kyledrake2y ago

When you had corruption and failures, what was the general procedure to deal with that? I love SeaweedFS and want to try it (Neocities is a nearly perfect use case), but part of my concern is not having a manual/documentation for the edge cases so I can figure things out on the fringes. I didn't see any documentation around that when I last looked but maybe I missed something.

(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)

tempest_2y ago

The dev is suprisingly helpful but yeah I agree the wiki is in need of some beefing up w.r.t operations.

chrislusf2y ago

Thanks for sharing! I work on SeaweedFS.

SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper. The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.

When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.

CyberDildonics2y ago

what makes it different is a new way of programming for the cloud era.

just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

How is that not mmap?

Also what is the difference between a file, an object, a blob, a filesystem and an object store? Is all this just files indexed with sql?

chrislusf2y ago

> How is that not mmap?

The allocated storage is append only. For updates, just allocate another blob. The deleted blobs would be garbage collected later. So it is not really mmap.

> Also what is the difference between a file, an object, a blob, a filesystem and an object store?

The answer would be too long to fit here. Maybe chatgpt can help. :)

> Is all this just files indexed with sql?

Sort of yes.

2 more replies

nh22y ago

First, the feature set you have built is very impressive.

I think SeaweedFS would really benefit from more documentation on what exactly it does.

People who want to deploy production systems need that, and it would also help potential contributors.

Some examples:

* It says "optimised for small files", but it is not super clear from the whitepaper and other documentation what that means. It mostly talks about about how small the per-file overhad is, but that's not enough. For example, on Ceph I can also store 500M files without problem, but then later discover that some operations that happen only infrequently, such as recovery or scrubs, are O(files) and thus have O(files) many seeks, which can mean 2 months of seeks for a recovery of 500M files to finish. ("Recovery" here means when a replica fails and the data is copied to another replica.)

* More on small files: Assuming small files are packed somehow to solve the seek problem, what happens if I delete some files in the middle of the pack? Do I get fragmentation (space wasted by holes)? If yes, is there a defragmentation routine?

* One page https://github.com/seaweedfs/seaweedfs/wiki/Replication#writ... says "volumes are append only", which suggests that there will be fragmentation. But here I need to piece together info from different unrelated pages in order to answer a core question about how SeaweedFS works.

* https://github.com/seaweedfs/seaweedfs/wiki/FAQ#why-files-ar... suggests that "vacuum" is the defragmentation process. It says it triggers automatically when deleted-space overhead reaches 30%. But what performance implications does a vacuum have, can it take long and block some data access? This would be the immediate next question any operator would have.

* Scrubs and integrity: It is common for redundant-storage systems (md-RAID, ZFS, Ceph) to detect and recover from bitrot via checksums and cross-replica comparisons. This requires automatic regular inspections of the stored data ("scrubs"). For SeaweedFS, I can find no docs about it, only some Github issues (https://github.com/seaweedfs/seaweedfs/issues?q=scrub) that suggest that there is some script that runs every 17 minutes. But looking at that script, I can't find which command is doing the "repair" action. Note that just having checksums is not enough for preventing bitrot: It helps detect it, but does not guarantee that the target number of replicas is brought back up (as it may take years until you read some data again). For that, regular scrubs are needed.

* Filers: For a production store of a highly-available POSIX FUSE mount I need to choose a suitable Filer backend. There's a useful page about these on https://github.com/seaweedfs/seaweedfs/wiki/Filer-Stores. But they are many, and information is limited to ~8 words per backend. To know how a backend will perform, I need to know both the backend well, and also how SeaweedFS will use it. I will also be subject to the workflows of that backend, e.g. running and upgrading a large HA Postgres is unfortunately not easy. As another example, Postgres itself also does not scale beyond a single machine, unless one uses something like Citus, and I have no info on whether SeaweedFS will work with that.

* The word "Upgrades" seems generally un-mentioned in Wiki and README. How are forward and backward compatibility handled? Can I just switch SeaweedFS versions forward and backward and expect everything will automatically work? For Ceph there are usually detailed instructions on how one should upgrade a large cluster and its clients.

In general the way this should be approached is: Pretend to know nothing about SeaweedFS, and imagine what a user that wants to use it in production wants to know, and what their followup questions would be.

Some parts of that are partially answered in the presentations, but it is difficult to piece together how a software currently works from presentations of different ages (maybe they are already outdated?) and the presentations are also quite light on infos (usually only 1 slide per topic). I think the Github Wiki is a good way to do it, but it too, is too light on information and I'm not sure it has everything that's in the presentations.

I understand the README already says "more tools and documentation", I just want to highlight how important the "what does it do and how does it behave" part of documentation is for software like this.

nh22y ago

I posted this on https://github.com/seaweedfs/seaweedfs/discussions/5290

clankstar2y ago

We (https://hivegames.io/) use this for storing 50+ TB of multiplayer match recordings ("replays"), heavily using the built-in expiry feature. It's incredibly easy to use and to built on top off; never had an issue updating, migrating or utilizing new features.

candiddevmike2y ago

What do you use for the metadata store?

jug2y ago

This sounds like what Microsoft has tried but failed to do in numerous iterations for two decades: OFS (Cairo, unreleased predecessor to Windows 95), Storage+ (SQL Server 7.0), RFS (SQL Server 2000), Exchange Webstore, Outlook LIS, WinFS, and finally Microsoft Semantic Engine.

All projects were either cancelled, features cut, or officially left in limbo.

It's a pretty remarkable piece of Microsoft history as it has been there on the sidelines since roughly post-Windows 3.11. The reason they returned to it so often was in part because Bill Gates loved the idea of a more high level object storage that, like this, bridges the gap between files and databases.

He would probably have loved this kind of technology part of Windows -- and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft, that it was ahead its time and that it would re-emerge.

osigurdson2y ago

>> and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft,

Failing to capture any of the mobile handset market while missing out almost entirely on search and social media businesses would be higher on my list if I were in BG's shoes.

mbreese2y ago

Intellectually though, I could see the WinFS failure as being more disappointing. If it had worked, local computing would have been completely different. Much like BeFS, WinFS (as marketed) would have introduced many new ways to interact with your computer and data.

Having a bigger presence in mobile and social would have been more lucrative, but from a CS geek point of view, the failure of WinFS might have been more stinging.

EVa5I7bHFq9mnYK2y ago

Speaking of mobile handset markets, does SeaweedFS support Android?

Guthur2y ago

Microsoft has never been good in either consumer electronics or advertising (social media).

MS carved out economic rent on business with Windows and Office, Apple actually failed at that.

1 more reply

foota2y ago

Did those happen under Gates?

gorset2y ago

I was asking around in my network after experience with self hosting S3 like solutions. One serious user of SeaweedFS recommended looking into min.io instead. Another serious user of min.io recommend looking into SeaweedFS instead…

jasonjayr2y ago

If your looking for more recommendations, try Garage ( https://garagehq.deuxfleurs.fr/ ), which is on my short list to try in my home lab...

jauntywundrkind2y ago

Longhorn is another that I see quite a lot, next to Ceph/Rook and lately SeaweedFS.

https://github.com/longhorn/longhorn

1 more reply

Already__Taken2y ago

It used to be if you wanted thousands of tiny files give seaweed a go, minio would suck. But minio has since had a revision so you'd have to test it out.

Seaweed has been running my k8s persistent volumes pretty admirably for like a year for about 4 devs.

SOLAR_FIELDS2y ago

I set up seaweed on my home lab a few weeks ago and while it was a bit difficult to initially get everything configured it seems to work really well once I got it running. I’m using it with the CSI driver and have it incrementally backing up to s3 (I originally had it mirrored then realized for my use case I didn’t need to do that).

My one complaint is that I could not really get it to work with an S3 compatible api that wasn’t officially supported in the list of S3 backends, even though that should have been theoretically possible. I ended up picking a supported backend instead.

tempest_2y ago

Lets say instead of 1000s I need to store billions.

So far I have been testing with seaweed and it seems to chug along fine at around ~4B files and it is still increasing.

Has minio improved on that lately ?

seized2y ago

Take a look at GarageS3, it's a niceoption for "just an S3 server" for self hosting.

https://garagehq.deuxfleurs.fr/

I use it for self hosting.

junon2y ago

Sounds like you should try both and write an article!

chaxor2y ago

A serious user of both suggested to use iroh instead

discardedrefuse2y ago

If you're talking about this https://github.com/n0-computer/iroh ... Iroh is a p2p file syncing protocol. That's not even close to the same wheelhouse as SeaweedFS. What was their rationale for recommending it?

papaver-somnamb2y ago

Tried and rejected SeaweedFS due to Postgres failing to even initialize itself on a POSIX FS volume mounted over SeaweedFS' CSI driver. And that's too bad, because SeaweedFS was otherwise working well!

What we need and haven't identified yet is an SDS system that provides both fully-compliant POSIX FS and S3 volumes, is FOSS, a production story where individuals can do all tasks competently/quickly/effectively (management, monitoring, disaster recovery incl. erasure coding and tooling), and CSI drivers that work with Nomad.

This rules out Ceph and friends. GarageFS, also mentioned in this thread, is S3 only. We went through everything applicable on the K8S drivers list https://kubernetes-csi.github.io/docs/drivers.html except for Minio, because it claimed it needed a backing store anyways (like Ceph) although just a few days ago I encountered Minio being used standalone.

While I'm on this topic, I noticed that the CSI drivers (SeaweedFS and SDS's in general) use tremendous resources when effecting mounts, instantiating nearly a whole OS (OCI image w/ functional userland) just to mount over what appears to be NFS or something.

jamesblonde2y ago

You do know that you cannont implement a fully-compliant POSIX FS with only the S3 API? None of the scalalbe SDS' support random writes. Atomic rename (for building transactional systems like lakehouse table formats) is not there. Listing of files is often eventually consistent. The closest functional API to a posix-compliant one in scalable SDS' is the HDFS API. Only ADLS supports that. But then again, they are the only one who enable you to fuse mount a directory for local FS read/write access. All of the S3 fuse mount stuff is fundamentally limited by the S3 API.

papaver-somnamb2y ago

This is where we learned that! Ceph does it, because separate components are responsible for each of underlying storage, S3 API, and FS API. We tripped on the Seaweed FS and the Garage FS indicia, where "FS" in these contexts typically means File System. But, neither SeaweedFS nor GarageFS is a File System at all; with grace and lenience they could be mildly regarded as Filing Systems, but the reality is that they are actually object stores. SeaweedOS? SeaweedS3?

arccy2y ago

running something like postgres over a networked filesystem sounds very wrong

magicalhippo2y ago

There was some work done to add a S3 storage backend for ZFS[1], precisely with the goal of running PosgreSQL on effectively external storage.

A key point was to effectively treat S3 as a huge, reliable disk with 10MB "sectors". So the bucket would contain tons of 10MB chunks and ZFS would let S3 handle the redundancy. For performance it was coupled with a large, local SSD-based write-back cache.

Sadly it seems the company behind this figured it needed to keep this closed-source in order to get ROI[2].

[1]: https://youtu.be/opW9KhjOQ3Q

[2]: https://github.com/openzfs/zfs/issues/12119

1 more reply

dexterdog2y ago

But it also sounds like a dream if it could actually work. If you have enough local, performant disk that you are sharing with the cluster you should be able to get good performance and rely on the system to provide resilience and extra space.

1 more reply

snthpy2y ago

What about JuiceFS?

I've never used it myself and just learned about it from this thread but it seems to fit the bill.

1 more reply

4by4by42y ago

We tested both SeaweedFS and Min.io for cheaply (HDD) storing > 100TB of audio data.

Seaweed had much better performance for our use case.

Scaevolus2y ago

Do you wish it supported Erasure Coding for lower disk usage, or is your workload such that the extra spindles from replication are useful?

4by4by42y ago

That would be nice and that’s why we first tried MinIO.

But with MinIO and erasure coding a single PUT results in more IOPS and we saw lower performance.

Also, expanding MinIO must be done in increments of your original buildout which is annoying. So if you start with 4 servers and 500TB, they recommend you expand by adding another 4 servers with 500TB at least.

bomewish2y ago

Forgive my ignorance but why is this preferable to a big ZFS pool?

chaxor2y ago

I could be wrong here, but I believe this (ceph, et al) is the answer to the question: > """But what if I don't have a JBOD of 6x18TB hard drives with good amount of ECC RAM for ZFS? What if I have 3 raspberry pi 4's, at different houses with 3x 12TB externals on them, and 2 other computers with 2x 4TB externals on them, and I want to use that all together with some redundancy/error checking?" That would give (3x3x12)+(2x2x4)=124 TB of storage, vs 108TB in the ZFS single box case (of raw storage).

If you could figure out the distributed part (and inconsistency in disk size and such), then this is a very nice system to have.

1 more reply

chillfox2y ago

Because you need an S3 compatible API?

I use ZFS for most of my things, but I have yet to find a good way of just sharing a ZFS dataset over S3.

4by4by42y ago

Not the only reason, but we have a distributed workload so HTTP is a better protocol than NFS for our use case.

1 more reply

riku_iki2y ago

its distributed: will survive if your server dies..

erikaww2y ago

Any hiccups?

Drop in S3 compatibility with much better performance would be insane

4by4by42y ago

Setup is a little obscure, but the developer is responsive on Slack and GH.

We are only a couple months in and haven’t had to add to our cluster yet, storing about 250TB, so it’s still early for us. Promising so far and hardware has already paid for itself.

_zoltan_2y ago

why not ceph?

KaiserPro2y ago

Things to make sure of when choosing your distributed storage:

1) are you _really_ sure you need it distributed, or can you shard it your self? (hint, distributed anything sucks at least one if not two innovation tokens, if you're using other innovation tokens as well. you're going to have a very bad time)

2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)

3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)

4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)

5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)

6) do you care about availability, consistency or speed? (pick one, maybe one and a half)

7) how are you going to recover from the distributed storage shitting it's self all at the same time

8) how are you going to control access?

flemhans2y ago

1) only if it removes a "janitor" token of nannying the servers. Right now I just have one big server with a big 160TB ZFS pool, but it's running out.

2) No modifications, just new files and the occasional deletion request.

3) Almost just 1 write and 1 read per file, this is a backing storage for the source files, and they are cached in front.

4) Never

5) Files are written only by one other server, and there will be no parallel writes.

6) I pick consistency and as the half, availability.

7) This happened something like 15 years ago with MogileFS and thus scared us away. (Hence the single-server ZFS setup).

8) Reads are public, writes restricted to one other service that may write.

KaiserPro2y ago

GPFS is pretty sexy nowadays, although its really expensive: https://www.ibm.com/products/storage-scale

SheddingPattern2y ago

Sounds like you are talking from experience. Are you storage specialist, how did you learn so much about this?

KaiserPro2y ago

VFX engineer, I have suffered through:

_early_ lustre (its much better now)

GPFS

Gluster (fuck that)

clustered XFS (double fuck that)

Isilon

Nowadays, a single 2u server can realistically support 2x 100gig nics at full bore. So the biggest barrier is density. You can probably get 1pb in a rack now, and linking a bunch of jbods(well NVMEs) is probably easily to do now.

1 more reply

PhilippGille2y ago

The comments already mention several alternatives (Minio, Ceph, GarageFS). I think another one, not mentioned yet, is JuiceFS [1]. Found one comparison here [2].

[1] https://juicefs.com/en/

[2] https://dzone.com/articles/seaweedfs-vs-juicefs-in-design-an...

papaver-somnamb2y ago

JuiceFS isn't standalone, it requires separate backing storage for each of data [0] and metadata. So for example, JuiceFS would target SeaweedFS or GarageFS as its data store. JuiceFS can also target the local file system, but .. SDS use cases typically care about things like redundancy and availability of the data itself, things that JuiceFS happily delegates. JuiceFS itself can be distributed, but that's merely the control place as I understand it.

[0] https://juicefs.com/docs/community/reference/how_to_set_up_o...

remram2y ago

I tried JuiceFS with AWS S3 and an (admittedly slow) self-hosted postgres instance, and it didn't work at all. I would have understood if it had been really slow, but erroring it really seems wrong for software where correctness is paramount.

mbreese2y ago

Does anyone know how well the Seaweed Filer works for renaming files or missing files? My use case’s involves writing a lot of data to temporary files that are then renamed to their final name. This is always the Achilles heel for distributed file storage, where files are put into buckets based on the file path… when you rename the path, but keep the data, lookups become more complicated.

(This is HPC work with large processing pipelines. I keep track of if the job was successful based upon if the final file exists. The rename only happens if the job was successful. It’s a great way to track pipeline status, but metadata lookups can be a pain — particularly for missing files. )

chrislusf2y ago

Should not be a problem.

One similar use case used Cassandra as SeaweedFS filer store, and created thousands of files per second in a temp folder, and moved the files to a final folder. It caused a lot of tombstones for the updates in Cassandra.

Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.

pitherpather2y ago

I am aware of some research into operating systems with a database rather than filesystem as a base layer. If SeaweedFS serves a middle-ground between databases and filesystems, could it also suggest a middle-ground in conceiving of research operating systems??

vlovich1232y ago

SeaweedFS is a non hierarchical distributed key value store. It makes different tradeoffs to a filesystem which provides a hierarchical view of local only data. There’s some evidence to suggest that a hierarchical structuring of the data itself isn’t actually beneficial for modern systems. And you could design a system that used similar techniques to SeaweedFS to do a semi-distributed local store (ie locally stored data for fast access with offloading to cheap remote storage for durability / infinite extensibility). And the plain KV store will likely be faster for most operations although in practice you’ll probably only see it in micro benchmarks.

jakjak1232y ago

Have used SeaweedFS to store billions of thumbnails. The tooling is a bit clunky, but it mostly works. The performance is very good for small-ish objects (memory usage + latency), and latency remains consistently good into 99.9 percentiles. We had some issues with data loss and downtime, but that was mostly our own fault.

hardwaresofton2y ago

What issues did you run into? Not settling replication?

jakjak1232y ago

This was a couple of years ago now, but for example, some very minor amount of objects had not been replicated at all. This happened during heavy concurrent write traffic, and a couple of these race condition-ish bugs have been fixed over the years.

DrDroop2y ago

I there any reason to use something like this instead of S3 or similar products when you are not running your own infra?

throwup2382y ago

If you’re running it on AWS? Probably not.

Otherwise: egress costs.

lolpanda2y ago

For companies hosting their entire infra on AWS, what's the advantage of SeaweedFS running on a fleet of EC2 machines over storing on S3?

ddorian432y ago

Nothing. AWS doesn't give you the option to rent HDDs to create your own S3 so you're locked in to use S3.

fodkodrasz2y ago

Hard to imagine anything.

jamesblonde2y ago

Nobody pointed out yet that Chris, the main developer, developed this for Roblox. My kids love Roblox - massively popular game.

sighansen2y ago

I don't understand why you wouldn't just use plain s3. There is no comparison in the readme and I would love to understand what the benefits are. Also I would have expected a comparison to maybe Apache Iceberg, but this might be more specialized for relational data lake data?

throw0101d2y ago

Advantages over Ceph?

ddorian432y ago

Ceph should have 10x+ metadata overhead for chunk storage. When using erasure-coding writes are faster because it's using replication and then erasure-coding is done async for whole volumes (30GB).

thrusong2y ago

I'm a small user— only about 250,000 objects in storage and a lot of those cold storage behind Cloudflare, but I've been using SeaweedFS for years.

I think since v0.7— I was always intrigued by Facebook's Haystack.

SeaweedFS been super reliable, efficient, and trouble free.

jdthedisciple2y ago

Sounds great!

Now I only need to wait 10 years until all the hidden but crucial bugs are found (at the massive loss of real data, ofc) before I'm ready to use it,

like with every new piece of technology...

Or what should give me the confidence that it isn't so?

stevekemp2y ago

This is an old project, I had a quick look and see that I submitted a pull-request back in 2015:

https://github.com/seaweedfs/seaweedfs/pull/187

zeeZ2y ago

First commit in the Google code repo seems to be 2011-11-30

_3u102y ago

What’s the different between files, objects, blobs and data lake?

killingtime742y ago

Each one is a pay increase for the administrator and vendor.

fefferkorn2y ago

@chrislusf, i use btrfs with lz4 compression and beesd for deduplicatiion,.. does seaweed support chunking in a way so that deduplication happens?

monlockandkey2y ago

What would be the best S3 like storage software with user based access and limits that I can locally host?

fuddle2y ago

Is it compatible with OpenStack?

j / k navigate · click thread line to collapse

123 comments

bnewbold2y ago

The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.

seized2y ago

GarageS3 is a nice middle ground, it is not file on disk per object but it's simpler than SeaweedFS as well.

https://garagehq.deuxfleurs.fr/

mdaniel2y ago

One will want to be cognizant that Garage, like recent MinIO releases, is AGPL https://git.deuxfleurs.fr/Deuxfleurs/garage/src/tag/v0.9.1/L...

I'm not trying to start trouble, only raising awareness because in some environments such a thing matters

anonzzzies2y ago

Yes, garage sourcecode is very easy to read and understand. Didn’t read seaweed yet.

ddorian432y ago

Garage has no intention to support erasure coding though.

no_wizard2y ago

Written in Go no less, a GC language!

I was expecting C/C++ or Rust, pleasantly surprised to see Go.

maayank2y ago

Why pleasantly surprised compared to Rust? What’s the significance of GCing?

1 more reply

riku_iki2y ago

> almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc

why you would base64 encode them, they all store binary formats?

pilgrim02y ago

I was quite surprised to discover that minio is one file per object. Having read some papers about object stores, this is definitely not what I expected.

blr_lpm2y ago

What are the pros/cons of storing one file per object? As a noob in this domain, this made sense to me.

It will be great if you can share name or reference of some papers around this. Thank you in advance.

4 more replies

vdm2y ago

This has not been true since 2021. https://blog.min.io/minio-optimizes-small-objects/

kyledrake2y ago

(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)

tempest_2y ago

The dev is suprisingly helpful but yeah I agree the wiki is in need of some beefing up w.r.t operations.

chrislusf2y ago

Thanks for sharing! I work on SeaweedFS.

SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper. The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.

When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.

CyberDildonics2y ago

what makes it different is a new way of programming for the cloud era.

just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

How is that not mmap?

Also what is the difference between a file, an object, a blob, a filesystem and an object store? Is all this just files indexed with sql?

chrislusf2y ago

> How is that not mmap?

The allocated storage is append only. For updates, just allocate another blob. The deleted blobs would be garbage collected later. So it is not really mmap.

> Also what is the difference between a file, an object, a blob, a filesystem and an object store?

The answer would be too long to fit here. Maybe chatgpt can help. :)

> Is all this just files indexed with sql?

Sort of yes.

2 more replies

nh22y ago

First, the feature set you have built is very impressive.

I think SeaweedFS would really benefit from more documentation on what exactly it does.

People who want to deploy production systems need that, and it would also help potential contributors.

Some examples:

nh22y ago

I posted this on https://github.com/seaweedfs/seaweedfs/discussions/5290

clankstar2y ago

candiddevmike2y ago

What do you use for the metadata store?

jug2y ago

All projects were either cancelled, features cut, or officially left in limbo.

osigurdson2y ago

>> and indeed in 2013, he cited the failure of WinFS as his greatest disappointment at Microsoft,

Failing to capture any of the mobile handset market while missing out almost entirely on search and social media businesses would be higher on my list if I were in BG's shoes.

mbreese2y ago

Having a bigger presence in mobile and social would have been more lucrative, but from a CS geek point of view, the failure of WinFS might have been more stinging.

EVa5I7bHFq9mnYK2y ago

Speaking of mobile handset markets, does SeaweedFS support Android?

Guthur2y ago

Microsoft has never been good in either consumer electronics or advertising (social media).

MS carved out economic rent on business with Windows and Office, Apple actually failed at that.

1 more reply

foota2y ago

Did those happen under Gates?

gorset2y ago

jasonjayr2y ago

If your looking for more recommendations, try Garage ( https://garagehq.deuxfleurs.fr/ ), which is on my short list to try in my home lab...

jauntywundrkind2y ago

Longhorn is another that I see quite a lot, next to Ceph/Rook and lately SeaweedFS.

https://github.com/longhorn/longhorn

1 more reply

Already__Taken2y ago

It used to be if you wanted thousands of tiny files give seaweed a go, minio would suck. But minio has since had a revision so you'd have to test it out.

Seaweed has been running my k8s persistent volumes pretty admirably for like a year for about 4 devs.

SOLAR_FIELDS2y ago

tempest_2y ago

Lets say instead of 1000s I need to store billions.

So far I have been testing with seaweed and it seems to chug along fine at around ~4B files and it is still increasing.

Has minio improved on that lately ?

seized2y ago

Take a look at GarageS3, it's a niceoption for "just an S3 server" for self hosting.

https://garagehq.deuxfleurs.fr/

I use it for self hosting.

junon2y ago

Sounds like you should try both and write an article!

chaxor2y ago

A serious user of both suggested to use iroh instead

discardedrefuse2y ago

papaver-somnamb2y ago

jamesblonde2y ago

papaver-somnamb2y ago

arccy2y ago

running something like postgres over a networked filesystem sounds very wrong

magicalhippo2y ago

There was some work done to add a S3 storage backend for ZFS[1], precisely with the goal of running PosgreSQL on effectively external storage.

Sadly it seems the company behind this figured it needed to keep this closed-source in order to get ROI[2].

[1]: https://youtu.be/opW9KhjOQ3Q

[2]: https://github.com/openzfs/zfs/issues/12119

1 more reply

dexterdog2y ago

1 more reply

snthpy2y ago

What about JuiceFS?

I've never used it myself and just learned about it from this thread but it seems to fit the bill.

1 more reply

4by4by42y ago

We tested both SeaweedFS and Min.io for cheaply (HDD) storing > 100TB of audio data.

Seaweed had much better performance for our use case.

Scaevolus2y ago

Do you wish it supported Erasure Coding for lower disk usage, or is your workload such that the extra spindles from replication are useful?

4by4by42y ago

That would be nice and that’s why we first tried MinIO.

But with MinIO and erasure coding a single PUT results in more IOPS and we saw lower performance.

bomewish2y ago

Forgive my ignorance but why is this preferable to a big ZFS pool?

chaxor2y ago

If you could figure out the distributed part (and inconsistency in disk size and such), then this is a very nice system to have.

1 more reply

chillfox2y ago

Because you need an S3 compatible API?

I use ZFS for most of my things, but I have yet to find a good way of just sharing a ZFS dataset over S3.

4by4by42y ago

Not the only reason, but we have a distributed workload so HTTP is a better protocol than NFS for our use case.

1 more reply

riku_iki2y ago

its distributed: will survive if your server dies..

erikaww2y ago

Any hiccups?

Drop in S3 compatibility with much better performance would be insane

4by4by42y ago

Setup is a little obscure, but the developer is responsive on Slack and GH.

We are only a couple months in and haven’t had to add to our cluster yet, storing about 250TB, so it’s still early for us. Promising so far and hardware has already paid for itself.

_zoltan_2y ago

why not ceph?

KaiserPro2y ago

Things to make sure of when choosing your distributed storage:

2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)

3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)

4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)

5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)

6) do you care about availability, consistency or speed? (pick one, maybe one and a half)

7) how are you going to recover from the distributed storage shitting it's self all at the same time

8) how are you going to control access?

flemhans2y ago

1) only if it removes a "janitor" token of nannying the servers. Right now I just have one big server with a big 160TB ZFS pool, but it's running out.

2) No modifications, just new files and the occasional deletion request.

3) Almost just 1 write and 1 read per file, this is a backing storage for the source files, and they are cached in front.

4) Never

5) Files are written only by one other server, and there will be no parallel writes.

6) I pick consistency and as the half, availability.

7) This happened something like 15 years ago with MogileFS and thus scared us away. (Hence the single-server ZFS setup).

8) Reads are public, writes restricted to one other service that may write.

KaiserPro2y ago

GPFS is pretty sexy nowadays, although its really expensive: https://www.ibm.com/products/storage-scale

SheddingPattern2y ago

Sounds like you are talking from experience. Are you storage specialist, how did you learn so much about this?

KaiserPro2y ago

VFX engineer, I have suffered through:

_early_ lustre (its much better now)

GPFS

Gluster (fuck that)

clustered XFS (double fuck that)

Isilon

1 more reply

PhilippGille2y ago

The comments already mention several alternatives (Minio, Ceph, GarageFS). I think another one, not mentioned yet, is JuiceFS [1]. Found one comparison here [2].

[1] https://juicefs.com/en/

[2] https://dzone.com/articles/seaweedfs-vs-juicefs-in-design-an...

papaver-somnamb2y ago

[0] https://juicefs.com/docs/community/reference/how_to_set_up_o...

remram2y ago

mbreese2y ago

chrislusf2y ago

Should not be a problem.

Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.

pitherpather2y ago

vlovich1232y ago

jakjak1232y ago

hardwaresofton2y ago

What issues did you run into? Not settling replication?

jakjak1232y ago

DrDroop2y ago

I there any reason to use something like this instead of S3 or similar products when you are not running your own infra?

throwup2382y ago

If you’re running it on AWS? Probably not.

Otherwise: egress costs.

lolpanda2y ago

For companies hosting their entire infra on AWS, what's the advantage of SeaweedFS running on a fleet of EC2 machines over storing on S3?

ddorian432y ago

Nothing. AWS doesn't give you the option to rent HDDs to create your own S3 so you're locked in to use S3.

fodkodrasz2y ago

Hard to imagine anything.

jamesblonde2y ago

Nobody pointed out yet that Chris, the main developer, developed this for Roblox. My kids love Roblox - massively popular game.

sighansen2y ago

throw0101d2y ago

Advantages over Ceph?

ddorian432y ago

Ceph should have 10x+ metadata overhead for chunk storage. When using erasure-coding writes are faster because it's using replication and then erasure-coding is done async for whole volumes (30GB).

thrusong2y ago

I'm a small user— only about 250,000 objects in storage and a lot of those cold storage behind Cloudflare, but I've been using SeaweedFS for years.

I think since v0.7— I was always intrigued by Facebook's Haystack.

SeaweedFS been super reliable, efficient, and trouble free.

jdthedisciple2y ago

Sounds great!

Now I only need to wait 10 years until all the hidden but crucial bugs are found (at the massive loss of real data, ofc) before I'm ready to use it,

like with every new piece of technology...

Or what should give me the confidence that it isn't so?

stevekemp2y ago

This is an old project, I had a quick look and see that I submitted a pull-request back in 2015:

https://github.com/seaweedfs/seaweedfs/pull/187

zeeZ2y ago

First commit in the Google code repo seems to be 2011-11-30

_3u102y ago

What’s the different between files, objects, blobs and data lake?

killingtime742y ago

Each one is a pay increase for the administrator and vendor.

fefferkorn2y ago

@chrislusf, i use btrfs with lz4 compression and beesd for deduplicatiion,.. does seaweed support chunking in a way so that deduplication happens?

monlockandkey2y ago

What would be the best S3 like storage software with user based access and limits that I can locally host?

fuddle2y ago

Is it compatible with OpenStack?

j / k navigate · click thread line to collapse