That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?
I guess what I'm asking is: how does VectorVFS enable search besides iterating through all files and iteratively comparing file embeddings with the embedding of a search query? The project description says "efficient and semantically searchable" and "eliminating the need for external index files or services" but I can't think of any more efficient way to do a search without literally walking the entire filesystem tree to look for the file with the most similar vector.
Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:
> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."
[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...
xattrs are better be forgotten already. it was just as dumb idea as macos resource forks/
I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....
Still, fun idea :)
If you gotta gather the data from a lot of different inodes, it is a different story.
1. https://weaviate.io/developers/weaviate/installation/embedde... 2. https://weaviate.io/developers/academy/py/vector_index/flat
For example, if we simply had the ability to have "ordered" files inside folders, that would instantly make it practical for a folder structure to represent "Documents". After all, documents are nothing but a list of paragraphs and images, so if we simply had ordering in file systems we could have document editors which are using individual files for each paragraph of text or image. It would be amazing.
Also think about use cases like Jupyter Notebooks. We could stop using the XML file format, and just make it a folder structure instead. Each cell (node) being in a file. All social media messages and chatbot conversations could be easily saved as folders structures.
I've heard many file copy tools ignore XATTR so I've never tried to use it for this purpose, so maybe we've had the capability all along and just nobody thought to use it in a big way that became popular yet. Maybe I should consider XATTR and take it seriously.
Re: ordered files: depends on FS. e.g. filesystems which use B+ trees will tend to have files (in directories) in lexical order. So in some cases you may not need a new FS:
echo 'for f in *.txt; do cat "$f"; done' > doc.sh; chmod +x doc.sh
=> `doc.sh` in dir produces 'documents' (add newlines / breaks as needed, or add piping through Markdown processor); symlink to some standardized filename 'Process', etc...That said... wouldn't it be nice to have ridiculous easily pluggable features like
echo "finish this poem: roses are red," > /auto-llm/poem.txt; cat ..
:)[1]: chaotic notes: https://kfs.mkj.lt/#welcome (see bullet point list below)
Does VectorVFS do retrieval, or store embeddings in EXT4?
Is retrieval logic obscured by VectorVFS?
If VectorVFS did retrieval with non-opaque embeddings, how would one debug why a file surfaced?
Storing Embeddings with File is interesting concept... we already do it for some file formats (ie EXIF), where this one is generalized... yet you would need to have some actual database to load this data into to process at scale.
Another issue I see is support for different models and embedding formats to make this data really portable - like I can take my file drop it into any system and its embedding "seamlessly" integrates
- hard links (only tar works for backup)
- small file size (or inodes run out before disk space)
http://root.rupy.seIt's very useful for global distributed real-time data that don't need the P in CAP for writes.
(no new data can be created if one node is offline = you can login, but not register)
Ain't no such thing as zero-overhead indexing. Just because you can't articulate where the overhead is doesn't make it disappear.
Let me ask another question: is this intended for production use, or is it more of a research project? Because as a user I care about things like speed, simplicity, flexibility, and robustness.
I'd be surprised if cloud storage services like OneDrive don't already do some kind of vector for every file you store. But an online web service isn't the same as being built into the core of the OS.
I invented it because I found searching conventional file systems that support extended attributes to be unbearably slow.
Microsoft saw the tech support nightmare this could generate, and abandoned the project.
When BigQuery was still in alpha I had to ingest ~15 billion HTTP requests a day (headers, bodies, and metadata). None of the official tooling was ready, so I wrote a tiny bash script that:
1. uploaded the raw logs to Cloud Storage, and
2. tracked state with three folders: `pending/`, `processing/`, `done/`.
A cron job cycled through those directories and quietly pushed petabytes every week without dropping a byte. Later, Google’s own pipelines—and third-party stacks like Logstash—never matched that script’s throughput or reliability.Lesson: reach for the filesystem first; add services only once you’ve proven you actually need them.
[1] https://en.wikipedia.org/wiki/Everything_is_a_file [2] https://en.wikipedia.org/wiki/Unix_philosophy
I would add that filesystems are superior to data formats (XML, JSON, YAML, TOML) for many use cases such as configuration or just storing data.
- Hierarchy are dirs,
- Keys are file names,
- Value is the content of the file.
- Other metadata are in hidden files
It will work forever, you can leverage ZFS, Git, rsync, syncthing much better. If you want, a fancy shells like Nushell will bring the experience pretty close to a database.
Most important you don't need fancy editor plugins or to learn XPath, jq or yq.
1. For config, it spreads the config across a bunch of nested directories, making it hard to read and write it without some sort of special tool that shows it all to you at once. Sure, you can easily edit 50 files from all sorts of directories in your text editor, but that’s pretty painful.
2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.
3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.
I'm being slightly hypocritical because I've made plenty of use of the filesystem as a configuration store. In code it's quite easy to stat one path relative to a directory, or open it and read it, so it's very tempting.
We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.
Traffic profile
- Baseline: ≈ 15 B requests/day
- Under attack: the same 15 B can arrive in 2-3 hours
Why BigQuery (even in alpha)?It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.
Pipeline (all shell + cron)
Edge nodes → write JSON logs locally and a local cron push to Cloud Storage
Tiny VM with a cron loop
- Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
- Executes `bq load …` into the customer’s isolated dataset.
- On success, moves the blob to `done/`; on failure, drops it back to `pending/`.
Downstream ML/alerting* pulls straight from BigQueryThat handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.
Maybe with micro-kernels we'll finally fix this.
Almost all of the operations done on actual filesystems are not database like, they are close to the underlying hardware for practical reasons. If you want a database view, add one in an upper layer.
Could you provide reference information to support this background assertion? I'm not totally familiar with filesystems under the hood, but at this point doesn't storage hardware maintain an electrical representation relatively independent from the logical given things like wear leveling?
Would you store all your ~/ in something like SQLite database?
Actually yeah that sounds pretty good.
For Desktop/Finder/Explorer you'd just need a nice UI.
Searching Documents/projects/etc would be the same just maybe faster?
All the arbitrary stuff like ~/.npm/**/* would stop cluttering up my ls -la in ~ and could be stored in their own tables whose names I genuinely don't care about. (This was the dream of ~/Library, no?)
[edit] Ooooh, I get it now. This doesn't solve namespacing or traversal.
1. Distributed filesystems do often use databases for metadata (FoundationDB for 3FS being a recent example)
2. Using a B+ tree for metadata is not much different from having a sorted index
3. Filesystems are a common enough usecase that skipping the abstraction complexity to co-optimize the stack is warranted
Persistent file systems are essentially key-value stores, usually with optimizations for enumerating keys under a namespace (also known as listing the files in a directory). IMO a big problem with POSIX filesystems is the lack of atomicity and lock guarantees when editing a file. This and a complete lack of consistent networked API are the key reasons few treat file systems as KV stores. It's a pity, really.
The safest way to put the FS on a level-playing field with other interfaces is to make the kernel not know about, just as it doesn't know about, say, SQL.
I'll try to do an example. The kernel doesn't currently know about SQL. Instead, you e.g. connect to a socket, and start talking to postgres. Imagine if FS stuff was the same thing: you connect to a socket, and then issue various command to read and write files. Ignore perf for a moment, it works right?
Now, one counter-argument might be "hold up, what is this socket you need to connect to, isn't that part of a file system? Is there now an all-userspace inner filesystem, still kernel-supported 'meta filesystem'?" Well, the answer to that is maybe the Unix idea of making communication channels like pipes and (to a lesser extent) sockets, was a bad idea. Or rather, there may be nothing wrong with saying a directory can have a child which may be such a communication channel, but there is a problem with saying that every such communication channel should live inside some directory.
It's important to remember that the cloud is also invented by the old school and understanding the oscillation between client/server architectures vs local, and it's implication on topics of data and files is interesting too.
More questions means more learning until I learned there's no one right or wrong, just what works best, where, when, for how long, and what the tradeoffs are.
Quick wins/decisions are often bandaids that pile up in an different way.