Item 43707041

PaulHoule • 4 days ago

If you want to do document similarity ranking in general it works to find nearby points in word frequency space but not as well as: (1) applying an autoencoder or another dimensional reduction technique to the vectors or (2) running a BERT-like model and pooling over the documents [1].

I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.

I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)

I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style

https://en.wikipedia.org/wiki/Stylometry

[1] With https://sbert.net/ this is so easy.

antirez • 4 days ago

Indeed, but my problem is: all those vector databases (including Redis!) are always thought as useful in the context of learned embeddings, BERT, Clip, ... But I really wanted to show that vectors are very useful and interesting outside that space. Now, I also like encoders very well, but I have the feeling that the Vector Sets, as a data structure, needs to be presented as a general tool. So I really cherry picked a use case that I liked and where neural networks were not present. Btw, Redis Vector Sets support dimensionality reduction by random projection natively in the case the vector is too redundant. Yet, in my experiments, I found that using binary quantization (also supported) is a better way to save CPU/space compared to RP.