perone 4 days ago

Hi, not sure if I understood what you meant by opaque embeddings as well, but the reason why files surface or not is due to the similarity score (which is basically the dot product of embeddings).

2
jlhawn 4 days ago

How much work do you think it would be to also have a separate xattr which has a human-readable description of the file contents? I wonder if it that might already be an intermediate product of some of the embedding tools, like "arbitrary media" -> "text description of media" -> "embedding vector". You could store both of those as xattrs and you could debug by comparing your text query with the text description of the file contents as they should produce similar embedding vectors. You could even audit any file, assuming you know what its contents are, by checking the text description xattr generated by this program.

b0a04gl 4 days ago

Hi, what meant here embeddings here generated are not structured or labeled or tagged. so the dot product math just tells 'this file chunk is similar to the user's query' rather than saying 'this file chunks contains content about user's query'. similar to any thinking model's reasoning trace. i understand that's not even in this scope but what made me realise here this is the limitation of most vector stores and i forsee it's achievable in your implementation.