Item 43897147

jlhawn • 4 days ago

If I understand correctly, this is attaching metadata to files in a format that LLMs (or any tool that can understand the semantic embedding vector) can leverage to understand what a file is without having to actually read the contents of the file.

That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?

perone • 4 days ago

Hi, it is quite different, there is no LLM involved, we can certainly use it for a RAG for example, but what is currently implemented is basically a way to generate embeddings (vector representation) which are then used for search later, it is all offline and local (no data is ever sent to cloud from your files).

2 replies

jlhawn • 4 days ago

I understand that LLMs aren't involved in generating the embeddings and adding the xattrs. I was just wondering what the value add of this is if there's no other background process (like mds on macOS) which is using it to build a search index.

I guess what I'm asking is: how does VectorVFS enable search besides iterating through all files and iteratively comparing file embeddings with the embedding of a search query? The project description says "efficient and semantically searchable" and "eliminating the need for external index files or services" but I can't think of any more efficient way to do a search without literally walking the entire filesystem tree to look for the file with the most similar vector.

Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:

> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."

[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...

1 reply

freeamz • 3 days ago

Yeah this kind of setup is indefinitely scaleable, but not searchable without out a meta db/index keeping track of all the nodes.

pilooch • 4 days ago

Using it for a RAG is smart indeed, especially with a multimodal encoder (vision-rag), as the implementation would be straightforward from what you already have.

lstodd • 4 days ago

if you go look up how xattrs work, you will understand it's no different than just reading a chunk of the file in question, and in fact can be slower.

xattrs are better be forgotten already. it was just as dumb idea as macos resource forks/

1 reply

SoftTalker • 3 days ago

Also locks you in to filesystems that support them, which are not all of them or on all operating systems.

lstodd • 4 days ago

so, like magic(5)?

1 reply

mywittyname • 4 days ago

What is magic(5) and how is it similar to what was described?

5 replies

danudey • 4 days ago

magic(5) is a system for determining the type of a file by examining the 'magic bytes' at or near the start of a file.

For example, POSIX tar files have a defined file format that starts with a header struct: https://www.gnu.org/software/tar/manual/html_node/Standard.h...

You can see that at byte offset 257 is `char magic[6]`, which contains `TMAGIC`, which is the byte string "ustar\0". Thus, if a file has the bytes 'ustar\0' at offset 257 we can reasonably assume that it's a tar file. Almost every defined file type has some kind of string of 'magic' predefined bytes at a predefined location that lets a program know "yes, this is in fact a JPEG file" rather than just asserting "it says .jpg so let's try to interpret this bytestring and see what happens".

As for how it's similar: I don't think it actually is, I think that's a misunderstanding. The metadata that this vector FS is storing is more than "this is a a JPEG" or "this is a word document", as I understand it, so comparing it to magic(5) is extremely reductionist. I could be mistaken, however.

simcop2387 • 4 days ago

I think they're referring to this, https://linux.die.net/man/5/magic given the notation. That said I don't really see how it'd be all that relevant to the discussion so maybe i'm missing something else.

0x457 • 4 days ago

magic(5) means `man 5 magic`: https://linux.die.net/man/5/magic

It's just a tool that can read "magic bytes" to figure out what files contains. Very different from what VectorVFS is.

yjftsjthsd-h • 4 days ago

https://manpages.org/magic/5 is a database of file types, used by the file(1) command. I don't exactly follow how it's the same though; it would let you say "what files are videos" but not "what files are videos of a cat". Which is sort of related but unless I missed something there is a difference.

lstodd • 4 days ago

four people answered strictly correctly as to what magic(5) is, but not a single one realized that storing some aux data as xattr in linux FS is not in any way different from just storing the exact same data as a file header. which is how magic(5) works.

how come?

(besides good luck not forgetting to rsync those xattrs)