Item 43707296

Well, well, well, cocktailpeanuts. :spiderman_pointing:

I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.

cocktailpeanuts and I for example, mutually share some words like:

because, people, you're, don't, they're, software, that, but, you, want

Unfortunately, this is a forum where people will use words like "because, people, and software."

Because, well, people here talk about software.

<=^)

Edit: Neat work, nonetheless.

cratermoon • 4 days ago

I noted the "analyze" feature didn't seem as useful as it could be because the majority of the words are common articles and conjunctions. I'd like to see a version of analyze that filters out at least the following stop words: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

1 reply

antirez • 4 days ago

The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way. Burrows papers explain this very well, but in general we want to capture low-level structure, more than topics and exact non obvious words used. I tested the system with 10k words and removing the most common words, and you get totally different results (still useful, but not style matching), basically you get users grouped by interests.

2 replies

cratermoon • 4 days ago

>The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way.

Yes, that's good! I didn't state my interest clearly, though. I'd like to see the "analyze" result with the stop words excluded, not for the style comparison part, but for the reasons you state and others.

yorwba • 4 days ago

I think grouping users by interests would be a more interesting application. Most users don't have multiple accounts, but everyone probably shares some interests with other users, whom they might enjoy discovering.

2 replies

SchemaLoad • 4 days ago

Pretty sure the point here is to demonstrate how governments or other surveillance orgs can easily find your alt accounts even if you use Tor or any number of security tools.

alganet • 4 days ago

There's already plenty of those running everywhere.

alganet • 4 days ago

That seems to be a misconception.

The usage frequency of simple words is a powerful tell.

2 replies

andrewmcwatters • 4 days ago

I can understand the nuance of your assertion, but looking at the data returned by these results suggests it's not really all that powerful at all.

There are so many people that write like me apparently, that simple language seems more like a way to mask yourself in a crowd.

1 reply

alganet • 4 days ago

You can definitely mask writing style. If you can do that only by using simple words, I am not so sure.

cratermoon • 4 days ago

Indeed, some writing styles make frequent use of words like "that" and "just".