Item 44187741

blagie • 2 days ago

This data cannot be anonymized. This is trivial provable, both mathematically, but given the type of data, it should also be intuitively obvious to even the most casual observer.

If you're talking to ChatGPT about being hunted by a Mexican cartel, and having escaped to your Uncle's vacation home in Maine -- which is the sort of thing a tiny (but non-zero) minority of people ask LLMs about -- that's 100% identifying.

And if the Mexican cartel finds out, e.g. because NY Times had a digital compromise at their law firm, that means someone is dead.

Legally, I think NY Times is 100% right in this lawsuit holistically, but this is a move which may -- quite literally -- kill people.

zarzavat • 2 days ago

It's like anonymizing your diary by erasing your name on the cover.

JKCalhoun • 2 days ago

I don't dispute your example, but I suspect there is a non-zero number of cases that would not be so extreme, so obviously identifiable.

So, sure, no panacea, but .. why not for the cases where it would be a barrier?

1 reply

blagie • 22 hours ago

Because such cases don't really exist.

Your text used an unusual double ellipsis (" .. " instead of "... "), uncommon (not rare) generative vocabulary ("panacea"), etc. Statistics on those allows for pretty good re-identification.

Ditto for times you do things and work schedule.

Etc.

It's not "obviously identifiable," but a buffer overflow is not "obviously exploitable." Rather, it takes a very, very expert individual to write a script before everyone can exploit it.

Ditto here.