xnorswap 4 days ago

I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.

You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.

( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )

1
antirez 4 days ago

Maybe there isn't enough data for each user for pairs, but I thought about mixing the two approaches (but had no time to do it), that is, to have 350 components like now, for the single word frequency, plus other 350 for the most common pairs frequency. In this way part of the vector would remain a high enough signal even for users with comparable less data.

xnorswap 3 days ago

I've been thinking some more about this, and it occurred to me that you'd want to encode sentence boundaries as a pseudo-word in the n-tuples.

I then realised that "[period] <word>" would likely dominate most common pairs, and that a lot of time could be saved by simply recording the first word of sentences as their own vector set, in addition but separate to the regular word vector.

Whether this would be a stronger or weaker signal per-vector-space than the tail of words in the regular common-words vector I don't know.