Well, well, well, cocktailpeanuts. :spiderman_pointing:
I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.
cocktailpeanuts and I for example, mutually share some words like:
because, people, you're, don't, they're, software, that, but, you, want
Unfortunately, this is a forum where people will use words like "because, people, and software."
Because, well, people here talk about software.
<=^)
Edit: Neat work, nonetheless.
I noted the "analyze" feature didn't seem as useful as it could be because the majority of the words are common articles and conjunctions. I'd like to see a version of analyze that filters out at least the following stop words: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way. Burrows papers explain this very well, but in general we want to capture low-level structure, more than topics and exact non obvious words used. I tested the system with 10k words and removing the most common words, and you get totally different results (still useful, but not style matching), basically you get users grouped by interests.
>The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way.
Yes, that's good! I didn't state my interest clearly, though. I'd like to see the "analyze" result with the stop words excluded, not for the style comparison part, but for the reasons you state and others.
I think grouping users by interests would be a more interesting application. Most users don't have multiple accounts, but everyone probably shares some interests with other users, whom they might enjoy discovering.
Pretty sure the point here is to demonstrate how governments or other surveillance orgs can easily find your alt accounts even if you use Tor or any number of security tools.
That seems to be a misconception.
The usage frequency of simple words is a powerful tell.
I can understand the nuance of your assertion, but looking at the data returned by these results suggests it's not really all that powerful at all.
There are so many people that write like me apparently, that simple language seems more like a way to mask yourself in a crowd.
You can definitely mask writing style. If you can do that only by using simple words, I am not so sure.
Indeed, some writing styles make frequent use of words like "that" and "just".