malcolmgreaves 4 days ago

Fun idea storing embeddings in inodes! Very clever!

I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....

Still, fun idea :)

5
PaulHoule 4 days ago

The lack of an index is not bad at all if you have it stored contiguously in RAM: the mechanical sympathy is great, SIMD will spin like a top not to mention multithreaded programming, etc. Circa 2014 or so I worked on a search engine that scanned maybe 2GB worth of vectors for 10 million documents, queries were turned around in much less than a second, nobody complained about the speed.

If you gotta gather the data from a lot of different inodes, it is a different story.

ori_b 4 days ago

It's not stored continuously in ram. It's stored in extended attributes.

perone 4 days ago

Thanks. There is a bit of a nuance there, for example: you can build an index in first pass which will indeed be linear, but then later keep it in an open prompt for subsequent queries, I'm planning to implement that mode soon. But agree, it is not intended to search 10 million files, but you seldom have this use case in local use anyways.

binarymax 4 days ago

O(n) is still OK for vector search if n isn't too large. Filesystem search solutions are currently terrible, with background indexing jobs and poor relevance. This won't scale for every file on your system but anything in your working documents folder would easily work well.

int_19h 4 days ago

An index could be built on top of this though if desired. No need to have it in the FS itself.

yencabulator 2 days ago

But then there's no point in storing anything in xattrs.

int_19h 2 days ago

The reason would be that it's there as the source of truth, and when files e.g. get copied around, so does the metadata. The indexer doesn't need to be synchronous wrt such operations though, it can just watch the FS for changes and spin up reindexing as needed asynchronously.

esafak 4 days ago

thanks for saving readers time. If so this is not a viable tool for production.