— All writes cost $0.06/GB, since everything is first written to the EFS cache. For write-heavy applications, this could be a dealbreaker.
— Reads hitting the cache get billed at $0.03/GB. Large reads (>128kB) get directly streamed from the underlying S3 bucket, which is free.
— Cache is charged at $0.30/GB/month. Even though everything is written to the cache (for consistency purposes), it seems like it's only used for persistent storage of small files (<128kB), so this shouldn't cost too much.
aws S3FS is using normal FUSE interface, which would be super heavy due to inherent overhead of copying data back and forth between user space and kernel space, that is the initial concern when we tried to add the POSIX support for the original object storage design. Fortunately, we have found and open-sourced a perfect solution [2]: using FUSE_OVER_IO_URING + FUSE_PASSTHROUGH, we can maintain the same high-performance archtecture design of our original object storage. We'd like to come out a new blog post explain more details and reveal our performance numbers if anyone is interested with this.
[1] https://fractalbits.com/blog/why-we-built-another-object-sto...
Always uncached? S3 has pretty bad latency.
No reads from S3 are free. All outgoing traffic from AWS is charged no matter what.
The hardest part in building a distributed filesystem is atomic rename. It's always rename. Scalable metadata file systems, like Collosus/Tectonic/ADLSv2/HopsFS, are either designed around how to make rename work at scale* or how work around it at higher levels in the stack.
* https://www.hopsworks.ai/post/scalable-metadata-the-new-bree...
[1] https://github.com/fractalbits-labs/fractalbits-main/tree/ma...
https://github.com/fractalbits-labs/fractalbits-main/graphs/...
https://docs.cloud.google.com/storage/docs/hns-overview#feat...
That's one way to do it.
> When you create or modify files, changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT. Sync runs in both directions, so when other applications modify objects in the bucket, S3 Files automatically spots those modifications and reflects them in the filesystem view automatically.
That sounds about right given the above. I have trouble seeing this as something other than a giant "hack." I already don't enjoy projecting costs for new types of S3 access patterns and I feel like has the potential to double the complication I already experience here.
Maybe I'm too frugal, but I've been in the cloud for a decade now, and I've worked very hard to prevent any "surprise" bills from showing up. This seems like a great feature; if you don't care what your AWS bill is each month.
You can write into one and read out from the other and vice versa. Consistency guarantees kept within each but not between.
> For example, suppose you edit /mnt/s3files/report.csv through the file system. Before S3 Files synchronizes your changes back to the S3 bucket, another application uploads a new version of report.csv directly to the S3 bucket. When S3 Files detects the conflict, it moves your version of report.csv to the lost and found directory and replaces it with the version from the S3 bucket.
> The lost and found directory is located in your file system's root directory under the name .s3files-lost+found-file-system-id.
Would
mkfs.ext4 /dev/nvme0n1 && \
mount /dev/nvme0n1 /var/cache/fscache && \
mount -t s3files -o fsc fs-0aa860d05df9afdfe:/ /home/ec2-user/s3files
work out of the box? It does for EFS. It hardly seems worth it to offer a managed service that's effectively three shell commands, but this is AWS we're talking about.> Don't use the following mount options:
> - fsc – This option enables local file caching, but does not change NFS cache coherency, and does not reduce latencies.
If the S3 Files sync logic ran client-side, we could almost entirely avoid file access latency for cached files and paying for new expensive EFS disks. I already pay for a lot of NVMe disks, let me just use those!
That's true for any NFS setup, not just EFS. The benefit of local NFS caching is to speed up reads of large, immutable files, where latency is relatively negligible. I'm not sure why AWS specifically dissuades users from enabling caching, since it's not like bandwidth to an EFS volume is even in the ballpark of EBS/NVMe bandwidth.
Immutable files can be solved by chunking them, allowing files to be opened and appended to - we do this in HopsFS. However, random writes are typically not supported in scaleout metadata file systems - but rarely used by POSIX clients, thankfully.
If they ever ship in-place writes I'd want to see what happens to the consistency model and pricing first. That's where the actual simplicity lived, not in the API surface. Half the appeal of S3 over a real filesystem was that you couldn't shoot yourself in the foot with partial overwrites.
Built in cache, CDN compatible, JSON metadata, concurrency safe and it targets all S3 compatible storage systems.
Looks like they went back to a simpler solution they could deliver but with some obvious warts. good to see something get launched but the sausage making her was brutal.
The reality is that if you read https://www.allthingsdistributed.com/2026/04/s3-files-and-th..., it sounds like the great minds at S3 figured out that a caching layer was the way to go. We (EFS) fucking proposed that years ago. But we had to deal with Seattle and the S3 braintrust who didn't want to do that. I know we wrote a PRFAQ that was close to this concept probably four years ago. The political story is that EFS was taking over by S3 and the EFS folks didn't have the agency or political backing to build a more workable solution. So we wasted a shit ton of time tackling something that was never going to work and many of the tenured EFS engineers left.
Obviously not the same, but at home I am running a Raspberry Pi with s3fs mounting my personal S3 bucket. I am exposing the same directory with /etc/exports (NFS). Which also allows me to use filesystem-caching as a bonus on the client side.
On the other hand, I should probably move out from S3 and use R2 or something...
Having been a fan of S3 for such a long time, I'm really a fan of the design. It's a good compromise and kudos to whoever managed to push through the design.
So there always been a pressure to AWS make it work like that. I suspect the amount of support tickets AWS receives related to "My S3 backed project is slow/fails sometimes/run into AWS limits (like the max number of buckets per account)" and "Why don't.." questions in the design phase which many times AWS people are in the room, serve as enough of a long applied pressure to overcome technical limitations of S3.
I'm not a fan of this type of "let's put a fresh coat on top of it and pretend it's something that fundamentally is not" abstractions. But I suspect here is a case of social pressure turbo charged by $$$.
https://gist.github.com/huksley/44341276d7c269f092e10784959e...
You might want to play with memory params for GeeseFS for better results
Single PUT per file I assume?
My guess is this would only enable a read-replica and not backups as Litestream currently does?
I don't know if S3 Files implements fcntl() locking or does it correctly. But if it does, I believe SQLite should work on it correctly as well.
There have been many buggy NFS locking or caching implementations historically, which is why reason SQLite recommends against using it on NFS concurrently on multiple machines: https://sqlite.org/faq.html#:~:text=But%20use,time%2E
This SO reply suggests NFSv4 is better at this: https://unix.stackexchange.com/a/432519. But caveat it with this older reply: https://unix.stackexchange.com/a/1887
To the best of my knowledge (I worked a little on this long ago), on Linux even NFSv2 has done correct fcntl() locking for decades, if all the correct services are running and the options are set appropriately and it's Linux on both the client and server. But if something is not configured as it should be, then locking or caching may not work correctly.
From https://sqlite.org/wal.html
> All processes using a database must be on the same host computer; WAL does not work over a network filesystem. This is because WAL requires all processes to share a small amount of memory and processes on separate host machines obviously cannot share memory with each other.
This means that all of the non-atomic operations that you might want to do on S3 (including edits to the middle of files, renames, etc) are run on the machine running S3fs. As a result, if your machine crashes, it's not clear what's going to show up in your S3 bucket or if would corrupt things.
As a result, S3fs is also slow because it means that the next stop after your machine is S3, which isn't suitable for many file-based applications.
What AWS has built here is different, using EFS as the middle layer means that there's a safe, durable place for your file system operations to go while they're being assembled in object operations. It also means that the performance should be much better than s3fs (it's talking to ssds where data is 1ms away instead of hdds where data is 30ms away).
The purpose of S3 isn't to be cheap, it's to be simple.
I thought that would be their https://github.com/awslabs/mountpoint-s3 . But no mention about this one either.
S3 files does have the advantage of having a "shared" cache via EFS, but then that would probably also make the cache slower.
Reading through it, I was only thinking "is this distinguished engineer TOC 2M aware that people have been doing this since forever?".
Previously I have done a periodic script that would simply re-sync the directory which works well enough. But curious if there's anything else out there.
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...
we run datalakes using DuckLake and this sounds really useful. GCP should follow suit quickly.
Parquet is static append only, so DuckDB has no problems with those living on S3.
How do you see it helping with DuckLake?
Pre-compaction the recent data can be in small files, and the delete markers will also be in small files. This will bring down fetch times, while ducklake may have many of the larger blocks in memory or disk cache already.
Reading block headers for filtering is lots of small ranges, this could speed it up by 10x.
Would be really useful pre-compaction and to deal with small files issue without latency penalties
Sell the benefits.
I have around 9 TB in 21m files on S3. How does this change benefit me?
edited slightly ... i really need to turn 10 minute post delay back on.