formerly_proven 4 days ago

I'm surprised no one has made this yet with a clustered visualization.

3
PaulHoule 4 days ago

Personally I like this approach a lot

https://scikit-learn.org/stable/modules/generated/sklearn.ma...

I think other methods are more fashionable today

https://scikit-learn.org/stable/modules/manifold.html

particularly multi-dimension scaling, but personally I think tSNE plots are less pathological (they don't have as many of these crazy cusps that make me think it's projecting down from a higher-dimensional surface which is near-parallel to the page)

After processing documents with BERT I really like the clusters generated by the simple and old k-Means algorithm

https://scikit-learn.org/stable/modules/generated/sklearn.cl...

It has the problem that it always finds 20 clusters if you set k=20 and a cluster which really oughta be one big cluster might get treated as three little clusters but the clusters I get from it reflect the way I see things.

antirez 4 days ago

Redis supports random projection to a lower dimensionality, but the reality is that projecting a 350d vector into 2d is nice but does not remotely captures the "reality" of what is going on. But still, it is a nice idea to use some time. However I would do that with more than 350 top words, since when I used 10k it strongly captured the interest more than the style, so 2D projection of this is going to be much more interesting I believe.

layer8 4 days ago

Given that some matches are “mutual” and others are not, I don’t see how that could translate to a symmetric distance measure.

antirez 4 days ago

Imagine the 2D space, it also has the same property!

You have three points nearby, and a fourth a bit more distant. 4 best match is 1, but 1 best match is 2 and 3.

layer8 4 days ago

Good point, but the similarity score between mutual matches is still different, so it doesn’t seem to be a symmetric measure?

antirez 4 days ago

Your observation is really acute: the small difference is due to quantization. When we search for element A, that is int8 quantized by default, the code paths de-quantize it, then re-quantize it and searches. This produces a small loss of precision, like that:

redis-cli -3 VSIM hn_fingerprint ELE pg WITHSCORES | grep montrose

montrose 0.8640020787715912

redis-cli -3 VSIM hn_fingerprint ELE montrose WITHSCORES | grep pg

pg 0.8639097809791565

So why cosine similarity is commutative, the quantization steps lead to a small different result. But the difference is .000092 that is in practical terms not important. Redis can use non quantized vectors using the NOQUANT option in VADD, but this will make the vectors elements using 4 bytes per component: given that the recall difference is minimal, it is almost always not worth it.