So typically: swap off on servers. Do they have a server story?
Edit: I think that the use of ZFS for your /tmp would solve this. You get Error Corrected memory writing to an check-summed file system.
Although if you do swap on a server (and you should), the swap needs to be on a raid, otherwise your server will crash on a disk error.
Swap on a server is not meant for handling low memory issues, instead there's tons of data on a server that's almost never used, so instead swap that out and make more room for cache.
Second, the binaries of your processes are mapped in as named pages (because they come from the ELF file).
Named pages are generell not understood as "used" memory because they can be evicted and reclaimed, but if you have a service with a 150MB binary running, those 150MB of seemingly "free" memory are absolutely crucial for performance.
Running out of this 150MB of disk cache will result in the machine using up all I/O capacities to re-fetch the ELF from disk and likely become unresponsive. Having swap does significantly delay this lock-up by allowing anonymous pages to be evicted, so the same memory pressure will cause less stalls.
So until the OOM management on Linux gets fixed, you need swap.
I meant this sort of jokingly. I think have a few linux systems that were never configured with swap partitions or swapfiles.
I still think it's a terrible idea.
If you have swap already it doesn't matter, but I've encountered enough thrashing that I now disable swap on almost all servers I work with.
It's rare but when it happens the server usually becomes completely unresponsive, so you have to hard reset it. I'd rather that the application trying to use too much memory is killed by the oom manager and I can ssh in and fix that.
[1] https://docs.redhat.com/en/documentation/red_hat_enterprise_...
But the auto-cleanup feature looks awful to me. Be it desktop or servers, machine with uptime of more than a year, I never saw the case of tmp being filled just by forgotten garbage. Only sometimes filled by unzipping a too big file or something like that. But it is on the spot.
It used to be the place where you could store cache or other things like that that will hold until next reboot. It looks so arbitrary and source of random unexpected bugs to have files being automatically deleted there after random time.
I don't know where this feature comes from, but when stupid risky things like this are coming, I would easily bet that it is again a systemd "I know best what is good for you" broken feature shoved through our throats...
And if coming from systemd, expect that one day it will accidentally delete important filed from you, something like following symlinks to your home dir or your nvme EFI partition...
It might have more to do with the type of developers I've worked with, but it happens all the time. Monitoring complains and you go into check, and there it is gigabytes of junk dumped there by shitty software or scripts that can't cleanup after themselves.
The issue is that you don't always knows what's safe to delete, if you're the operations person, and not the developer. Periodically auto-cleaning /tmp is going to do break stuff, and it will be easier to demand that the operations team disable auto-cleanup than getting the issue fixed in the developers next sprint.
Store tmpfs in memory: volatile but limited to free ram or swap, and that writes to disk
Store tmpfs on dedicated volume: Since we're going to write onto disk anyway, make it a lightweight special purpose file system that's commited to disk
On disk tmpfs but cleaned up periodically: additional settings to clean up - how often, what should stay, tie file lifetime to machine reboot? The answers to these questions vary more between applications than between filesystems, therefore it's more flexible to leave clean up to userspace.
In the end my main concern turned out to be that I lost files that I didn't want to lose, either to reboot cleanup, on timer cleanup, etc. I opted to clean up my temp files manually as needed.
It’s still not an ideal solution though.
Most disks have a lot of write cycles available that you'll be fine anyway, but it's a tiny benefit.
Didnt need it on NetBSD, memory could go to zero and system would (thrash but) not crash. When I switched to Linux the OOM issue was a shock at first but I learned to avoid it
I use small form factor computers, with userland mounted in and running from memory, no swap; I only use longterm storage for non-temporary data
https://www.kingston.com/unitedkingdom/en/blog/pc-performanc...
https://www.redhat.com/en/blog/polyinstantiating-tmp-and-var...
If that happens, reading the file back is DRAMATICALLY slower than if you had just stored the file on disk in the first place.
This change is not going to speed things up for most users, it will slow things. Instead of caching important files, you waste memory on useless temporary files. Then the system swaps it out, so you can get cache back, and then it's really slow to read back.
This change is a mistake.
Also, you can easily disable it: https://www.debian.org/releases/trixie/release-notes/issues....
When my Linux VM starts swapping I have to either wait an hour or more to regain control, or just hard restart the VM.
This changes puts the least important data in ram - temp files - while evicting much more important cache data.
EDIT: Thank you, jaunty. But all of these are device level. Even bcachefs was block device level. It doesn't allow union over a FUSE FS etc. It seems strange to not have it at the filesystem level.
EDIT: So, wikipedia lists overlayfs and aufs as active projects and unionfs predates both. Maybe unionfs v2 is what replaced all that? Maybe I'm hallucinating...
What I want is pretty much like how a write-through cache would work.
1. Write to top-level FS? The write cascades down but reads are fast immediately
2. Data not available in top-level FS? The read goes down to the bottom level and then reads up to the top so future reads are fast.
(Also, sorry but this article absolutely does not constitute a “deep dive” into anything.)
Who runs around with 100gb+ of swap?!
Our default server images come with a 4.4GB /tmp partition...
My /tmp is my default folder for downloads and temporary work. It will grow 100GB+ easily.
And high memory pressure is also what makes disk-backed /tmp slow. No improvement at all.
It's a really bad idea to put /tmp into memory. Filesystems already use memory when necessary and spill to the filesystem when memory is under pressure. If they don't do this correctly (which they do) then fix your filesystem! That will benefit everything.
This feels like a very unnecessary change and nothing in that article made a convincing argument for the contrary.
Does /dev/shm stay? Surely it does but it is also capped at 50% RAM. Does that mean /dev/shm + /tmp can now get to 100% RAM? Or do they share the same ram budget?