Show HN: VectorVFS, your filesystem as a vector database (opens in new tab)

(vectorvfs.readthedocs.io)

279 pointsperone1y ago138 comments

138 comments

If I understand correctly, this is attaching metadata to files in a format that LLMs (or any tool that can understand the semantic embedding vector) can leverage to understand what a file is without having to actually read the contents of the file.

That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?

peroneOP1y ago

Hi, it is quite different, there is no LLM involved, we can certainly use it for a RAG for example, but what is currently implemented is basically a way to generate embeddings (vector representation) which are then used for search later, it is all offline and local (no data is ever sent to cloud from your files).

jlhawn1y ago

I understand that LLMs aren't involved in generating the embeddings and adding the xattrs. I was just wondering what the value add of this is if there's no other background process (like mds on macOS) which is using it to build a search index.

I guess what I'm asking is: how does VectorVFS enable search besides iterating through all files and iteratively comparing file embeddings with the embedding of a search query? The project description says "efficient and semantically searchable" and "eliminating the need for external index files or services" but I can't think of any more efficient way to do a search without literally walking the entire filesystem tree to look for the file with the most similar vector.

Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:

> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."

[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...

freeamz1y ago

Yeah this kind of setup is indefinitely scaleable, but not searchable without out a meta db/index keeping track of all the nodes.

pilooch1y ago

Using it for a RAG is smart indeed, especially with a multimodal encoder (vision-rag), as the implementation would be straightforward from what you already have.

lstodd1y ago

if you go look up how xattrs work, you will understand it's no different than just reading a chunk of the file in question, and in fact can be slower.

xattrs are better be forgotten already. it was just as dumb idea as macos resource forks/

SoftTalker1y ago

Also locks you in to filesystems that support them, which are not all of them or on all operating systems.

lstodd1y ago

so, like magic(5)?

mywittyname1y ago

What is magic(5) and how is it similar to what was described?

danudey1y ago

magic(5) is a system for determining the type of a file by examining the 'magic bytes' at or near the start of a file.

For example, POSIX tar files have a defined file format that starts with a header struct: https://www.gnu.org/software/tar/manual/html_node/Standard.h...

You can see that at byte offset 257 is `char magic[6]`, which contains `TMAGIC`, which is the byte string "ustar\0". Thus, if a file has the bytes 'ustar\0' at offset 257 we can reasonably assume that it's a tar file. Almost every defined file type has some kind of string of 'magic' predefined bytes at a predefined location that lets a program know "yes, this is in fact a JPEG file" rather than just asserting "it says .jpg so let's try to interpret this bytestring and see what happens".

As for how it's similar: I don't think it actually is, I think that's a misunderstanding. The metadata that this vector FS is storing is more than "this is a a JPEG" or "this is a word document", as I understand it, so comparing it to magic(5) is extremely reductionist. I could be mistaken, however.

simcop23871y ago

I think they're referring to this, https://linux.die.net/man/5/magic given the notation. That said I don't really see how it'd be all that relevant to the discussion so maybe i'm missing something else.

0x4571y ago

magic(5) means `man 5 magic`: https://linux.die.net/man/5/magic

It's just a tool that can read "magic bytes" to figure out what files contains. Very different from what VectorVFS is.

yjftsjthsd-h1y ago

https://manpages.org/magic/5 is a database of file types, used by the file(1) command. I don't exactly follow how it's the same though; it would let you say "what files are videos" but not "what files are videos of a cat". Which is sort of related but unless I missed something there is a difference.

lstodd1y ago

four people answered strictly correctly as to what magic(5) is, but not a single one realized that storing some aux data as xattr in linux FS is not in any way different from just storing the exact same data as a file header. which is how magic(5) works.

how come?

(besides good luck not forgetting to rsync those xattrs)

malcolmgreaves1y ago

Fun idea storing embeddings in inodes! Very clever!

I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....

Still, fun idea :)

PaulHoule1y ago

The lack of an index is not bad at all if you have it stored contiguously in RAM: the mechanical sympathy is great, SIMD will spin like a top not to mention multithreaded programming, etc. Circa 2014 or so I worked on a search engine that scanned maybe 2GB worth of vectors for 10 million documents, queries were turned around in much less than a second, nobody complained about the speed.

If you gotta gather the data from a lot of different inodes, it is a different story.

ori_b1y ago

It's not stored continuously in ram. It's stored in extended attributes.

peroneOP1y ago

Thanks. There is a bit of a nuance there, for example: you can build an index in first pass which will indeed be linear, but then later keep it in an open prompt for subsequent queries, I'm planning to implement that mode soon. But agree, it is not intended to search 10 million files, but you seldom have this use case in local use anyways.

binarymax1y ago

O(n) is still OK for vector search if n isn't too large. Filesystem search solutions are currently terrible, with background indexing jobs and poor relevance. This won't scale for every file on your system but anything in your working documents folder would easily work well.

int_19h1y ago

An index could be built on top of this though if desired. No need to have it in the FS itself.

yencabulator1y ago

But then there's no point in storing anything in xattrs.

int_19h1y ago

The reason would be that it's there as the source of truth, and when files e.g. get copied around, so does the metadata. The indexer doesn't need to be synchronous wrt such operations though, it can just watch the FS for changes and spin up reindexing as needed asynchronously.

esafak1y ago

thanks for saving readers time. If so this is not a viable tool for production.

anotherpaul1y ago

Great idea indeed. The documentation needs a bit more information to be useful. What GPU backends are supported for example? How do I delete the embedding information after I decide to uninstall it? Will give it a try though.

peroneOP1y ago

Thanks, I'm working on implementing the commands to clean the embeddings (you can now do that with Linux xattr command-line tool). I'm supporting CPU or GPU (NVIDIA) for the encoders and it only supports Linux at the moment.

3abiton1y ago

I am curious why Python, and not rust for example?

danudey1y ago

Not OP, but despite working in an all-Go shop I just wrote a utility in Python the other week and caught some flak for it.

The reason I gave (which was accepted) was that the process of creating a proof of concept and iterating on it rapidly is vastly easier in Python (for me) than it is in Go. In essence, it would have taken me at least a week, possibly more, to write the program I ended up with in Golang, but it only took me a day to write it in Python, and, now that I understand the problem and have a working (production-ready) prototype, it would probably only take me another day to rewrite it in Golang.

Also, a large chunk of the functionality in this Python script seems to be libraries - pillow for image processing, but also pytorch and related vision/audio/codec libraries. Even if similar production-ready Rust crates are available (I'm not sure if they are), this kind of thing is something Python excels at and which these modules are already optimized for. Most of the "work" happening here isn't happening in Python, by and large.

1 more reply

peroneOP1y ago

Hi, I think Rust won't bring much benefit here to be honest, the bottleneck is mainly the model and model loading. It would probably be a nightmare to load these models from Rust, I would have to use torch bindings and then convert everything from the preprocessing already in Python to Rust.

thirdtrigger1y ago

Might be interesting to add an optional embedded Weaviate [1] with a flat-index [2] to the project. It wouldn't use external services and is fully disk-based. Would allow you to search the whole filesystem (about 1.5kb per file (384 dimensions) which would be added to the metadata as well).

1. https://weaviate.io/developers/weaviate/installation/embedde... 2. https://weaviate.io/developers/academy/py/vector_index/flat

binarymax1y ago

Why weaviate and not FAISS? The latter is faster and lighter.

bobvanluijt1y ago

It depends on additional filters and whether you want to use vector search only. The upside of using Faiss would be storing the ID as file metadata and embedding it in the Faiss index. However, if you need any other filters or data, you would need to store it somewhere else.

lysp1y ago

I think they are associated with the project

ndsipa_pomu1y ago

I've long wanted to have a linux filesystem that robustly supported "tags" for files so that I didn't have to rely on the filesystem hierarchy to represent media files etc. e.g. I might want to tag a particular films as "Scifi" and also "Horror". Of course, for films, NFO files are typically used for this kind of metadata, but I'd like a similar facility that could be applied to any type of file.

sneak1y ago

That is literally what xattrs are for.

ndsipa_pomu1y ago

Yes, but they seem fairly limited in terms of userspace programs. How would you use xattrs to produce a filesystem hierarchy that say, listed the same file in multiple folders according to the attributes?

sneak1y ago

The xattrs are for storing the tag metadata, you’d use other tools (easily composed from shell utilities) to find files that match tags. If you really want it to be in multiple locations, you could make a fuse interface that shows directories full of files matching specific tags.

1 more reply

quantadev1y ago

I've been wondering for about 20 years why File Systems basically died and stopped innovating. For example we have lots of hierarchical data structures in the world, and no one seems to have figured out how to let a folder be the storage, instead of always just databases.

For example, if we simply had the ability to have "ordered" files inside folders, that would instantly make it practical for a folder structure to represent "Documents". After all, documents are nothing but a list of paragraphs and images, so if we simply had ordering in file systems we could have document editors which are using individual files for each paragraph of text or image. It would be amazing.

Also think about use cases like Jupyter Notebooks. We could stop using the XML file format, and just make it a folder structure instead. Each cell (node) being in a file. All social media messages and chatbot conversations could be easily saved as folders structures.

I've heard many file copy tools ignore XATTR so I've never tried to use it for this purpose, so maybe we've had the capability all along and just nobody thought to use it in a big way that became popular yet. Maybe I should consider XATTR and take it seriously.

wfn1y ago

I agree! I'm sort of exploring "programmable filesystem" concept (using FuseFS) (for some notes, see [1]).

Re: ordered files: depends on FS. e.g. filesystems which use B+ trees will tend to have files (in directories) in lexical order. So in some cases you may not need a new FS:

    echo 'for f in *.txt; do cat "$f"; done' > doc.sh; chmod +x doc.sh

=> `doc.sh` in dir produces 'documents' (add newlines / breaks as needed, or add piping through Markdown processor); symlink to some standardized filename 'Process', etc...

That said... wouldn't it be nice to have ridiculous easily pluggable features like

    echo "finish this poem: roses are red," > /auto-llm/poem.txt; cat ..

[1]: chaotic notes: https://kfs.mkj.lt/#welcome (see bullet point list below)

quantadev1y ago

Interesting stuff. Thanks for posting that.

b0a04gl1y ago

If VectorVFS obscures retrieval logic behind opaque embeddings, how do users debug why a file surfaced—or worse, why one didn’t?

peroneOP1y ago

Hi, not sure if I understood what you meant by opaque embeddings as well, but the reason why files surface or not is due to the similarity score (which is basically the dot product of embeddings).

jlhawn1y ago

How much work do you think it would be to also have a separate xattr which has a human-readable description of the file contents? I wonder if it that might already be an intermediate product of some of the embedding tools, like "arbitrary media" -> "text description of media" -> "embedding vector". You could store both of those as xattrs and you could debug by comparing your text query with the text description of the file contents as they should produce similar embedding vectors. You could even audit any file, assuming you know what its contents are, by checking the text description xattr generated by this program.

refulgentis1y ago

What is a non-opaque embedding?

Does VectorVFS do retrieval, or store embeddings in EXT4?

Is retrieval logic obscured by VectorVFS?

If VectorVFS did retrieval with non-opaque embeddings, how would one debug why a file surfaced?

PeterZaitsev1y ago

I think comparing it to Vector Database is confusing as database would typically mean indexes and some sort of query support.

Storing Embeddings with File is interesting concept... we already do it for some file formats (ie EXIF), where this one is generalized... yet you would need to have some actual database to load this data into to process at scale.

Another issue I see is support for different models and embedding formats to make this data really portable - like I can take my file drop it into any system and its embedding "seamlessly" integrates

bullen1y ago

I did something similar, but I use these EXT4 requirements:

  - hard links (only tar works for backup)
  - small file size (or inodes run out before disk space)

http://root.rupy.se

It's very useful for global distributed real-time data that don't need the P in CAP for writes.

(no new data can be created if one node is offline = you can login, but not register)

yencabulator1y ago

> Zero-overhead indexing Embeddings are stored as extended attributes (xattrs) on each file, eliminating the need for external index files or services.

Ain't no such thing as zero-overhead indexing. Just because you can't articulate where the overhead is doesn't make it disappear.

natas1y ago

this is actually a great idea

iugtmkbdfil8341y ago

Assuming I understand it correctly, the idea is to be able to have LLMs get through file systems more easily with some interesting benefits to human users as well. The idea is interesting and I want to try it out.

peroneOP1y ago

Hi, there are no LLMs involved, it is all local and an embedding (vector representation) of the data is created and then that is used for search later, nothing is sent to cloud from your files and there are no local LLMs running as well, only the encoders (I use the Perception Encoder from Meta released a few weeks ago).

PeterStuer1y ago

If there is no indexing, how will your search time not increase linear or worse with the number of files?

esafak1y ago

Files-as-vector stores is LanceDB's value proposition. How do you compare in performance, etc.?

peroneOP1y ago

This is quite different than LanceDB. In VectorVFS I'm using the inodes directly to store the embeddings, there is no external file with metadata and db, the db is your filesystem itself, that's the key difference.

esafak1y ago

That's an implementation detail, and it sounds more like a liability than a selling point, to have such tight coupling. (Why) do you see not using files as a good thing?

Let me ask another question: is this intended for production use, or is it more of a research project? Because as a user I care about things like speed, simplicity, flexibility, and robustness.

asadawadia1y ago

is the embedding for the whole file? or each 1024/512 byte chunk?

javier21y ago

i looked into something similar a few years ago, where i stored embeddings in xattrs

adenta1y ago

I wonder if I could use this locally on my macbook. The finder applications built-in search is kinda meh.

peroneOP1y ago

I'm planning to support MacOS, the only issue is with the encoders that I'm using now, I will probably work more on it next week to try to make a release that works on MacOS as well. Thanks !

badmonster1y ago

interesting

pseudosavant1y ago

This immediately made me nostalgic for BeOS's BeFS or Windows Longhorn's WinFS database filesystems, and how this kind of thing would have fit them perfect. So much cool stuff you could do with vectors for everything. Smart folders that include files for a project based on a description of the project. Show me all of my config files for appXYZ. Images of a black dog at the beach. At the OS-level for any other app to easily tap into.

I'd be surprised if cloud storage services like OneDrive don't already do some kind of vector for every file you store. But an online web service isn't the same as being built into the core of the OS.

peroneOP1y ago

I share the same feeling, I think filesystems will have to reinvent themselves given the pace of how useful ML models became in the past years.

didgetmaster1y ago

I built a local object store that was designed to replace file systems. You can create hundreds of millions of objects (e.g. files) and attach a variety of metadata tags to each one. A tag could be a number, string, or other data type (including vector info). Searches for objects with certain tags is exceptionally fast.

I invented it because I found searching conventional file systems that support extended attributes to be unbearably slow.

tugdual1y ago

Got a demo ?

1 more reply

p_ing1y ago

WinFS wasn't a file system laid down on hardware, it was just a SQL database that stored arbitrary data.

aforwardslash1y ago

Most modern filesystems aren't really "laid down on hardware", but on top of a volume manager; thats how both attached disks (think SAN or EBS volumes) and RAID works. Most filesystems behave exactly like an optimized database index, but lacking the extra fields; also, some databases support raw devices directly(using mapped volumes), no need for filesystems. The line between both is mostly a blur.

didgetmaster1y ago

I think that is one of the main reason it failed to launch. It was just too easy for the metadata stored in the separate database to become out of sync with the actual file data.

Microsoft saw the tech support nightmare this could generate, and abandoned the project.

pseudosavant1y ago

They just weren't able to pull it off for whatever reason. I actually ran BeOS as my daily driver for quite a while (way) back in the day. BeFS was genuinely amazing, and not something I've seen replicated elsewhere yet. There hasn't really been anything interesting done in filesystems used by users on devices in a really long time.

1 more reply

aforwardslash1y ago

It failed because it was slow and provided no obvious benefit to the average user, other than wasting disk space. There is a fair amount of tricks and optimizations in NTFS to balance speed of access and shadow copy mechanisms, in a pre-SSD era, where the average random throughput of a consumer-grade SATA disk was single digit MB/s

p_ing1y ago

It was abandoned due to The Cloud. There was no need for WinFS as a tech when you could store everything in The Cloud.

It was also complex, ran poorly, and would have required developers to integrate their applications.

Microsoft had long solved the problem of blobs and metadata in ESE and SharePoint's use of MS SQL for binary + metadata storage.

1 more reply

tzury1y ago

I’ve found that starting with a plain old filesystem often outperforms fancy services - just as the Unix philosophy (“everything is a file” [1]) has preached for decades [2].

When BigQuery was still in alpha I had to ingest ~15 billion HTTP requests a day (headers, bodies, and metadata). None of the official tooling was ready, so I wrote a tiny bash script that:

    1. uploaded the raw logs to Cloud Storage, and
    2. tracked state with three folders: `pending/`, `processing/`, `done/`.

A cron job cycled through those directories and quietly pushed petabytes every week without dropping a byte. Later, Google’s own pipelines—and third-party stacks like Logstash—never matched that script’s throughput or reliability.

Lesson: reach for the filesystem first; add services only once you’ve proven you actually need them.

[1] https://en.wikipedia.org/wiki/Everything_is_a_file [2] https://en.wikipedia.org/wiki/Unix_philosophy

sunshine-o1y ago

Absolutely.

I would add that filesystems are superior to data formats (XML, JSON, YAML, TOML) for many use cases such as configuration or just storing data.

- Hierarchy are dirs,

- Keys are file names,

- Value is the content of the file.

- Other metadata are in hidden files

It will work forever, you can leverage ZFS, Git, rsync, syncthing much better. If you want, a fancy shells like Nushell will bring the experience pretty close to a database.

Most important you don't need fancy editor plugins or to learn XPath, jq or yq.

drob5181y ago

Yes, but a couple downsides:

1. For config, it spreads the config across a bunch of nested directories, making it hard to read and write it without some sort of special tool that shows it all to you at once. Sure, you can easily edit 50 files from all sorts of directories in your text editor, but that’s pretty painful.

2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.

3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.

sunshine-o1y ago

> 1. For config, it spreads the config across a bunch of nested directories, making it hard to read and write it without some sort of special tool that shows it all to you at once. Sure, you can easily edit 50 files from all sorts of directories in your text editor, but that’s pretty painful.

It really depends how comfortable you are using the shell and which one you use.

cat, tree, sed, grep, etc will get you quite far and one might argue that it is simpler to master than vim and various format. Actually mastering VSCode also takes a lot of efforts.

> 2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.

> 3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.

Agreed but for most use case here it really doesn't matter and if I need to optimise storage I will need a database anyway.

And I sincerely believe that most micro optimisations at the filesystem level are cancelled by running most editors with data format support enabled....

cryptonector1y ago

Except that now when you do need a tool like XSLT/XPath, jq, or yq, now you need bash. I use bash lots, but still I'd rather use a better language, like the ones you listed.

I'm being slightly hypocritical because I've made plenty of use of the filesystem as a configuration store. In code it's quite easy to stat one path relative to a directory, or open it and read it, so it's very tempting.

user39393821y ago

You don’t need bash to traverse a file system, are you saying something else?

1 more reply

ryanianian1y ago

Not sure if it's still in use, but for a very long time, AWS billing relied on getting usage data via rsync.

cratermoon1y ago

Command line tools can be 225x faster than a Hadoop cluster. https://news.ycombinator.com/item?id=17135841

dominicq1y ago

Can you say more about the use case? What problem were you solving? How did it work exactly? Sounds interesting so I'd like to learn more.

tzury1y ago

Sure.

We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.

   Traffic profile

     - Baseline: ≈ 15 B requests/day
     - Under attack: the same 15 B can arrive in 2-3 hours

Why BigQuery (even in alpha)?

It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.

Pipeline (all shell + cron)

Edge nodes → write JSON logs locally and a local cron push to Cloud Storage

Tiny VM with a cron loop

   - Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
   - Executes `bq load …` into the customer’s isolated dataset.
   - On success, moves the blob to `done/`; on failure, drops it back to `pending/`.

Downstream ML/alerting* pulls straight from BigQuery

That handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.

Ericson23141y ago

The idea that filesystems are not just a flavor of database management systems was always a mistake.

Maybe with micro-kernels we'll finally fix this.

foobiekr1y ago

Every single time this has been tried it has gone wrong, but sure.

Almost all of the operations done on actual filesystems are not database like, they are close to the underlying hardware for practical reasons. If you want a database view, add one in an upper layer.

jonhohle1y ago

BeOS got it right with BeFS. An Email client was just a folder. MP3s could be sorted and filtered in the file system. https://news.ycombinator.com/item?id=12309686

foobiekr1y ago

BeFS wasn't a database. It had indexed queries on EAs and they had the habit of asking application files to add their indexable content to the EAs. Internally it was just a mostly-not-transactional collection of btrees.

There was no query language for updating files, or even inspecting anything about a file that was not published in the EAs (or implicitly do as with adapters), there were no multi-file transactions, no joins, nothing. Just rich metadata support in the FS.

1 more reply

int_19h1y ago

Windows does something similar with Explorer today when you open a folder that has mostly music files in it.

packetlost1y ago

I don't see how file systems aren't some sort of DBMS, definitely not relational but that wasn't a stated requirement.

Ericson23141y ago

Yes, it makes me sad that no everyone is on our level about this

adolph1y ago

> they are close to the underlying hardware for practical reasons

Could you provide reference information to support this background assertion? I'm not totally familiar with filesystems under the hood, but at this point doesn't storage hardware maintain an electrical representation relatively independent from the logical given things like wear leveling?

mrlongroots1y ago

Some examples off the top of my head:

- You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.

- If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward

- No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.

- The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs. Although you need to use the right interface to leverage it (libaio/io_uring/SPDK).

2 more replies

Ericson23141y ago

Yes I agree, that assertion doesn't pass muster.

Mature database implementations also bypass a lot of kernel machinary to get closer to the underlying block devices. The layering of DB on top of FS is a failure.

1 more reply

qwertox1y ago

I can't agree with this. I like it that I can have all these tools which work with files and are tools which are not db-oriented, and the fact that there are different filesystems for different scenarios, that I can sandwich LVM between a FS and the block device. That /proc/ can pretend to be a FS because else we'd possibly end up with something like the Windows Registry for these operations, only managed through a database.

Would you store all your ~/ in something like SQLite database?

90s_dev1y ago

> Would you store all your ~/ in something like SQLite database?

Actually yeah that sounds pretty good.

For Desktop/Finder/Explorer you'd just need a nice UI.

Searching Documents/projects/etc would be the same just maybe faster?

All the arbitrary stuff like ~/.npm/**/* would stop cluttering up my ls -la in ~ and could be stored in their own tables whose names I genuinely don't care about. (This was the dream of ~/Library, no?)

[edit] Ooooh, I get it now. This doesn't solve namespacing or traversal.

fc417fc8021y ago

> This doesn't solve namespacing or traversal.

That's "just" API. FS is "just" a KV store with a weird crufty API and a few extra tricks (bind mounts or whatever).

I think the primary issue is the difference in performance between different strategies. It would be interesting to have a FS with different types of folders similar to how (for example) btrfs is generally CoW but you can turn that off via an attribute.

hdevalence1y ago

Yes, I would

mrlongroots1y ago

Thoughts:

1. Distributed filesystems do often use databases for metadata (FoundationDB for 3FS being a recent example)

2. Using a B+ tree for metadata is not much different from having a sorted index

3. Filesystems are a common enough usecase that skipping the abstraction complexity to co-optimize the stack is warranted

7qW24A1y ago

I’m a database guy, not an OS guy, so I agree, obviously… But what is the micro-kernel angle?

packetlost1y ago

Likely the idea that filesystems should run as userspace / unprivileged (or at least limited privilege) processes which would make them, ultimately, indistinguishable from a form of database engine.

Persistent file systems are essentially key-value stores, usually with optimizations for enumerating keys under a namespace (also known as listing the files in a directory). IMO a big problem with POSIX filesystems is the lack of atomicity and lock guarantees when editing a file. This and a complete lack of consistent networked API are the key reasons few treat file systems as KV stores. It's a pity, really.

mrlongroots1y ago

> "Likely the idea that filesystems should run as userspace / unprivileged (or at least limited privilege) processes which would make them, ultimately, indistinguishable from a form of database engine."

"Userspace vs not" is a different argument from "consistency vs not" or "atomicity vs not" or "POSIX vs not". Someone still needs to solve that problem. Sure instead of SQLite over POSIX you could implement POSIX over SQLite over raw blocks. But you haven't gained anything meaningful.

> Persistent file systems are essentially key-value stores

I think this is reductive enough to be equivalent to "a key-value store is a thin wrapper over the block abstraction, as it already provides a key-value interface, which is just a thin layer over taking a magnet and pointing it at an offset".

Persistent filesystems can be built over key-value stores. This is especially common in distributed filesystems. But they also circumvent a key-value abstraction entirely.

> IMO a big problem with POSIX filesystems is the lack of atomicity

Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.

> This and a complete lack of consistent networked API

A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.

Finally, nothing in the POSIX spec prohibits an atomic filesystem or consistency guarantees. It is just that no one wants to implement these things that way because it overprovisions for one property at the expense of others.

1 more reply

Ericson23141y ago

The filesystem interface is only privilaged interface because it is the kernel knows about. E.g. you can already use FUSE and NFS to roll your own FS implementations, but those do not a microkernel make, because the OS is still in the way dictating the implementation.

The safest way to put the FS on a level-playing field with other interfaces is to make the kernel not know about, just as it doesn't know about, say, SQL.

runlaszlorun1y ago

I've heard this mentioned a couple times but what would this look like functionality wise? A single "files" table with columns? Different tables for different categories of files? FTS? Something else?

Ericson23141y ago

See the other comments. The point is not a specific new interface, but a separation of concerns, and leveling the playing field.

I'll try to do an example. The kernel doesn't currently know about SQL. Instead, you e.g. connect to a socket, and start talking to postgres. Imagine if FS stuff was the same thing: you connect to a socket, and then issue various command to read and write files. Ignore perf for a moment, it works right?

Now, one counter-argument might be "hold up, what is this socket you need to connect to, isn't that part of a file system? Is there now an all-userspace inner filesystem, still kernel-supported 'meta filesystem'?" Well, the answer to that is maybe the Unix idea of making communication channels like pipes and (to a lesser extent) sockets, was a bad idea. Or rather, there may be nothing wrong with saying a directory can have a child which may be such a communication channel, but there is a problem with saying that every such communication channel should live inside some directory.

01HNNWZ0MV43FF1y ago

You could do a loopback network filesystem and make any user-space FS you want. That's what WSL does, and there's a Rust crate for it. Can't recall the name at all.

Ericson23141y ago

There is NFS and FUSE so you can write your own implementation, but you are still stuck with the interface that the kernel understands.

gitroom1y ago

Gotta say, the old school debate on filesystems vs databases will never get old for me - I always end up with more questions than answers after reading stuff like this.

j451y ago

Everything's old school, everything's new.

It's important to remember that the cloud is also invented by the old school and understanding the oscillation between client/server architectures vs local, and it's implication on topics of data and files is interesting too.

More questions means more learning until I learned there's no one right or wrong, just what works best, where, when, for how long, and what the tradeoffs are.

Quick wins/decisions are often bandaids that pile up in an different way.

colordrops1y ago

j / k navigate · click thread line to collapse

138 comments

jlhawn1y ago

peroneOP1y ago

jlhawn1y ago

Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:

> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."

[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...

freeamz1y ago

Yeah this kind of setup is indefinitely scaleable, but not searchable without out a meta db/index keeping track of all the nodes.

pilooch1y ago

Using it for a RAG is smart indeed, especially with a multimodal encoder (vision-rag), as the implementation would be straightforward from what you already have.

lstodd1y ago

if you go look up how xattrs work, you will understand it's no different than just reading a chunk of the file in question, and in fact can be slower.

xattrs are better be forgotten already. it was just as dumb idea as macos resource forks/

SoftTalker1y ago

Also locks you in to filesystems that support them, which are not all of them or on all operating systems.

lstodd1y ago

so, like magic(5)?

mywittyname1y ago

What is magic(5) and how is it similar to what was described?

danudey1y ago

magic(5) is a system for determining the type of a file by examining the 'magic bytes' at or near the start of a file.

For example, POSIX tar files have a defined file format that starts with a header struct: https://www.gnu.org/software/tar/manual/html_node/Standard.h...

simcop23871y ago

0x4571y ago

magic(5) means `man 5 magic`: https://linux.die.net/man/5/magic

It's just a tool that can read "magic bytes" to figure out what files contains. Very different from what VectorVFS is.

yjftsjthsd-h1y ago

lstodd1y ago

how come?

(besides good luck not forgetting to rsync those xattrs)

malcolmgreaves1y ago

Fun idea storing embeddings in inodes! Very clever!

Still, fun idea :)

PaulHoule1y ago

If you gotta gather the data from a lot of different inodes, it is a different story.

ori_b1y ago

It's not stored continuously in ram. It's stored in extended attributes.

peroneOP1y ago

binarymax1y ago

int_19h1y ago

An index could be built on top of this though if desired. No need to have it in the FS itself.

yencabulator1y ago

But then there's no point in storing anything in xattrs.

int_19h1y ago

esafak1y ago

thanks for saving readers time. If so this is not a viable tool for production.

anotherpaul1y ago

peroneOP1y ago

3abiton1y ago

I am curious why Python, and not rust for example?

danudey1y ago

Not OP, but despite working in an all-Go shop I just wrote a utility in Python the other week and caught some flak for it.

1 more reply

peroneOP1y ago

thirdtrigger1y ago

1. https://weaviate.io/developers/weaviate/installation/embedde... 2. https://weaviate.io/developers/academy/py/vector_index/flat

binarymax1y ago

Why weaviate and not FAISS? The latter is faster and lighter.

bobvanluijt1y ago

lysp1y ago

I think they are associated with the project

ndsipa_pomu1y ago

sneak1y ago

That is literally what xattrs are for.

ndsipa_pomu1y ago

sneak1y ago

1 more reply

quantadev1y ago

wfn1y ago

I agree! I'm sort of exploring "programmable filesystem" concept (using FuseFS) (for some notes, see [1]).

Re: ordered files: depends on FS. e.g. filesystems which use B+ trees will tend to have files (in directories) in lexical order. So in some cases you may not need a new FS:

    echo 'for f in *.txt; do cat "$f"; done' > doc.sh; chmod +x doc.sh

=> `doc.sh` in dir produces 'documents' (add newlines / breaks as needed, or add piping through Markdown processor); symlink to some standardized filename 'Process', etc...

That said... wouldn't it be nice to have ridiculous easily pluggable features like

    echo "finish this poem: roses are red," > /auto-llm/poem.txt; cat ..

[1]: chaotic notes: https://kfs.mkj.lt/#welcome (see bullet point list below)

quantadev1y ago

Interesting stuff. Thanks for posting that.

b0a04gl1y ago

If VectorVFS obscures retrieval logic behind opaque embeddings, how do users debug why a file surfaced—or worse, why one didn’t?

peroneOP1y ago

Hi, not sure if I understood what you meant by opaque embeddings as well, but the reason why files surface or not is due to the similarity score (which is basically the dot product of embeddings).

jlhawn1y ago

refulgentis1y ago

What is a non-opaque embedding?

Does VectorVFS do retrieval, or store embeddings in EXT4?

Is retrieval logic obscured by VectorVFS?

If VectorVFS did retrieval with non-opaque embeddings, how would one debug why a file surfaced?

PeterZaitsev1y ago

I think comparing it to Vector Database is confusing as database would typically mean indexes and some sort of query support.

Another issue I see is support for different models and embedding formats to make this data really portable - like I can take my file drop it into any system and its embedding "seamlessly" integrates

bullen1y ago

I did something similar, but I use these EXT4 requirements:

  - hard links (only tar works for backup)
  - small file size (or inodes run out before disk space)

http://root.rupy.se

It's very useful for global distributed real-time data that don't need the P in CAP for writes.

(no new data can be created if one node is offline = you can login, but not register)

yencabulator1y ago

> Zero-overhead indexing Embeddings are stored as extended attributes (xattrs) on each file, eliminating the need for external index files or services.

Ain't no such thing as zero-overhead indexing. Just because you can't articulate where the overhead is doesn't make it disappear.

natas1y ago

this is actually a great idea

iugtmkbdfil8341y ago

peroneOP1y ago

PeterStuer1y ago

If there is no indexing, how will your search time not increase linear or worse with the number of files?

esafak1y ago

Files-as-vector stores is LanceDB's value proposition. How do you compare in performance, etc.?

peroneOP1y ago

esafak1y ago

That's an implementation detail, and it sounds more like a liability than a selling point, to have such tight coupling. (Why) do you see not using files as a good thing?

Let me ask another question: is this intended for production use, or is it more of a research project? Because as a user I care about things like speed, simplicity, flexibility, and robustness.

asadawadia1y ago

is the embedding for the whole file? or each 1024/512 byte chunk?

javier21y ago

i looked into something similar a few years ago, where i stored embeddings in xattrs

adenta1y ago

I wonder if I could use this locally on my macbook. The finder applications built-in search is kinda meh.

peroneOP1y ago

I'm planning to support MacOS, the only issue is with the encoders that I'm using now, I will probably work more on it next week to try to make a release that works on MacOS as well. Thanks !

badmonster1y ago

interesting

pseudosavant1y ago

peroneOP1y ago

I share the same feeling, I think filesystems will have to reinvent themselves given the pace of how useful ML models became in the past years.

didgetmaster1y ago

I invented it because I found searching conventional file systems that support extended attributes to be unbearably slow.

tugdual1y ago

Got a demo ?

1 more reply

p_ing1y ago

WinFS wasn't a file system laid down on hardware, it was just a SQL database that stored arbitrary data.

aforwardslash1y ago

didgetmaster1y ago

I think that is one of the main reason it failed to launch. It was just too easy for the metadata stored in the separate database to become out of sync with the actual file data.

Microsoft saw the tech support nightmare this could generate, and abandoned the project.

pseudosavant1y ago

1 more reply

aforwardslash1y ago

p_ing1y ago

It was abandoned due to The Cloud. There was no need for WinFS as a tech when you could store everything in The Cloud.

It was also complex, ran poorly, and would have required developers to integrate their applications.

Microsoft had long solved the problem of blobs and metadata in ESE and SharePoint's use of MS SQL for binary + metadata storage.

1 more reply

tzury1y ago

I’ve found that starting with a plain old filesystem often outperforms fancy services - just as the Unix philosophy (“everything is a file” [1]) has preached for decades [2].

When BigQuery was still in alpha I had to ingest ~15 billion HTTP requests a day (headers, bodies, and metadata). None of the official tooling was ready, so I wrote a tiny bash script that:

    1. uploaded the raw logs to Cloud Storage, and
    2. tracked state with three folders: `pending/`, `processing/`, `done/`.

Lesson: reach for the filesystem first; add services only once you’ve proven you actually need them.

[1] https://en.wikipedia.org/wiki/Everything_is_a_file [2] https://en.wikipedia.org/wiki/Unix_philosophy

sunshine-o1y ago

Absolutely.

I would add that filesystems are superior to data formats (XML, JSON, YAML, TOML) for many use cases such as configuration or just storing data.

- Hierarchy are dirs,

- Keys are file names,

- Value is the content of the file.

- Other metadata are in hidden files

It will work forever, you can leverage ZFS, Git, rsync, syncthing much better. If you want, a fancy shells like Nushell will bring the experience pretty close to a database.

Most important you don't need fancy editor plugins or to learn XPath, jq or yq.

drob5181y ago

Yes, but a couple downsides:

2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.

sunshine-o1y ago

It really depends how comfortable you are using the shell and which one you use.

cat, tree, sed, grep, etc will get you quite far and one might argue that it is simpler to master than vim and various format. Actually mastering VSCode also takes a lot of efforts.

> 2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.

Agreed but for most use case here it really doesn't matter and if I need to optimise storage I will need a database anyway.

And I sincerely believe that most micro optimisations at the filesystem level are cancelled by running most editors with data format support enabled....

cryptonector1y ago

Except that now when you do need a tool like XSLT/XPath, jq, or yq, now you need bash. I use bash lots, but still I'd rather use a better language, like the ones you listed.

user39393821y ago

You don’t need bash to traverse a file system, are you saying something else?

1 more reply

ryanianian1y ago

Not sure if it's still in use, but for a very long time, AWS billing relied on getting usage data via rsync.

cratermoon1y ago

Command line tools can be 225x faster than a Hadoop cluster. https://news.ycombinator.com/item?id=17135841

dominicq1y ago

Can you say more about the use case? What problem were you solving? How did it work exactly? Sounds interesting so I'd like to learn more.

tzury1y ago

Sure.

We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.

   Traffic profile

     - Baseline: ≈ 15 B requests/day
     - Under attack: the same 15 B can arrive in 2-3 hours

Why BigQuery (even in alpha)?

It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.

Pipeline (all shell + cron)

Edge nodes → write JSON logs locally and a local cron push to Cloud Storage

Tiny VM with a cron loop

   - Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
   - Executes `bq load …` into the customer’s isolated dataset.
   - On success, moves the blob to `done/`; on failure, drops it back to `pending/`.

Downstream ML/alerting* pulls straight from BigQuery

That handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.

Ericson23141y ago

The idea that filesystems are not just a flavor of database management systems was always a mistake.

Maybe with micro-kernels we'll finally fix this.

foobiekr1y ago

Every single time this has been tried it has gone wrong, but sure.

Almost all of the operations done on actual filesystems are not database like, they are close to the underlying hardware for practical reasons. If you want a database view, add one in an upper layer.

jonhohle1y ago

BeOS got it right with BeFS. An Email client was just a folder. MP3s could be sorted and filtered in the file system. https://news.ycombinator.com/item?id=12309686

foobiekr1y ago

1 more reply

int_19h1y ago

Windows does something similar with Explorer today when you open a folder that has mostly music files in it.

packetlost1y ago

I don't see how file systems aren't some sort of DBMS, definitely not relational but that wasn't a stated requirement.

Ericson23141y ago

Yes, it makes me sad that no everyone is on our level about this

adolph1y ago

> they are close to the underlying hardware for practical reasons

mrlongroots1y ago

Some examples off the top of my head:

- You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.

- If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward

- No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.

2 more replies

Ericson23141y ago

Yes I agree, that assertion doesn't pass muster.

Mature database implementations also bypass a lot of kernel machinary to get closer to the underlying block devices. The layering of DB on top of FS is a failure.

1 more reply

qwertox1y ago

Would you store all your ~/ in something like SQLite database?

90s_dev1y ago

> Would you store all your ~/ in something like SQLite database?

Actually yeah that sounds pretty good.

For Desktop/Finder/Explorer you'd just need a nice UI.

Searching Documents/projects/etc would be the same just maybe faster?

[edit] Ooooh, I get it now. This doesn't solve namespacing or traversal.

fc417fc8021y ago

> This doesn't solve namespacing or traversal.

That's "just" API. FS is "just" a KV store with a weird crufty API and a few extra tricks (bind mounts or whatever).

hdevalence1y ago

Yes, I would

mrlongroots1y ago

Thoughts:

1. Distributed filesystems do often use databases for metadata (FoundationDB for 3FS being a recent example)

2. Using a B+ tree for metadata is not much different from having a sorted index

3. Filesystems are a common enough usecase that skipping the abstraction complexity to co-optimize the stack is warranted

7qW24A1y ago

I’m a database guy, not an OS guy, so I agree, obviously… But what is the micro-kernel angle?

packetlost1y ago

Likely the idea that filesystems should run as userspace / unprivileged (or at least limited privilege) processes which would make them, ultimately, indistinguishable from a form of database engine.

mrlongroots1y ago

> Persistent file systems are essentially key-value stores

Persistent filesystems can be built over key-value stores. This is especially common in distributed filesystems. But they also circumvent a key-value abstraction entirely.

> IMO a big problem with POSIX filesystems is the lack of atomicity

Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.

> This and a complete lack of consistent networked API

A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.

1 more reply

Ericson23141y ago

The safest way to put the FS on a level-playing field with other interfaces is to make the kernel not know about, just as it doesn't know about, say, SQL.

runlaszlorun1y ago

Ericson23141y ago

See the other comments. The point is not a specific new interface, but a separation of concerns, and leveling the playing field.

01HNNWZ0MV43FF1y ago

You could do a loopback network filesystem and make any user-space FS you want. That's what WSL does, and there's a Rust crate for it. Can't recall the name at all.

Ericson23141y ago

There is NFS and FUSE so you can write your own implementation, but you are still stuck with the interface that the kernel understands.

gitroom1y ago

Gotta say, the old school debate on filesystems vs databases will never get old for me - I always end up with more questions than answers after reading stuff like this.

j451y ago

Everything's old school, everything's new.

More questions means more learning until I learned there's no one right or wrong, just what works best, where, when, for how long, and what the tradeoffs are.

Quick wins/decisions are often bandaids that pile up in an different way.

colordrops1y ago

j / k navigate · click thread line to collapse