tptacek 4 days ago

This is an interesting and well-written post but the data in the app seems pretty much random.

1
antirez 4 days ago

Thank you, tptacek. I was able to verify, thanks to the Internet Archive caching of "pg" for the post of 3 years ago, that the entries are quite similar in the case of "pg". Consider that it captures just the statistical patterns in very common words, so you are not likely to see users that you believe are "similar" to yourself. Notably: montrose may likely be a really be a secondary account of PG, and was also found as a cross reference in the original work of three years ago.

Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.

Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.

EDIT: at the end of the post I added the visual representations of pg and montrose.

PaulHoule 4 days ago

If you want to do document similarity ranking in general it works to find nearby points in word frequency space but not as well as: (1) applying an autoencoder or another dimensional reduction technique to the vectors or (2) running a BERT-like model and pooling over the documents [1].

I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.

I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)

I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style

https://en.wikipedia.org/wiki/Stylometry

[1] With https://sbert.net/ this is so easy.

antirez 4 days ago

Indeed, but my problem is: all those vector databases (including Redis!) are always thought as useful in the context of learned embeddings, BERT, Clip, ... But I really wanted to show that vectors are very useful and interesting outside that space. Now, I also like encoders very well, but I have the feeling that the Vector Sets, as a data structure, needs to be presented as a general tool. So I really cherry picked a use case that I liked and where neural networks were not present. Btw, Redis Vector Sets support dimensionality reduction by random projection natively in the case the vector is too redundant. Yet, in my experiments, I found that using binary quantization (also supported) is a better way to save CPU/space compared to RP.