Not only does this mean OpenAI will have to retain this data on their servers, they could also be ordered to share it with the legal teams of companies they have been sued by during discovery (which is the entire point of a legal hold). Some law firm representing NYT could soon be reading out your private conversations with ChatGPT in a courtroom to prove their case.
My guess is they will store them on tape e.g. on something like Spectra TFinity ExaScale library. I assume AWS glacier et al use this sort of thing for their deep archives.
Storing them on something that has hours to days retrieval window satisfies the court order, is cheaper, and makes me as a customer that little bit more content with it (mass data breach would take months of plundering and easily detectable).
Glacier is tape silos, but this is textual data. You don't need to save output images, just the checkpoint+hash of the generating model and the seed. Stable diffusion saves this until you manually delete the metadata, for example. So my argument is you could do this with LTO as well. Text compresses well, especially if you don't do it naively.
> She suggested that OpenAI could have taken steps to anonymize the chat logs but chose not to
That is probably the solution right there.
This data cannot be anonymized. This is trivial provable, both mathematically, but given the type of data, it should also be intuitively obvious to even the most casual observer.
If you're talking to ChatGPT about being hunted by a Mexican cartel, and having escaped to your Uncle's vacation home in Maine -- which is the sort of thing a tiny (but non-zero) minority of people ask LLMs about -- that's 100% identifying.
And if the Mexican cartel finds out, e.g. because NY Times had a digital compromise at their law firm, that means someone is dead.
Legally, I think NY Times is 100% right in this lawsuit holistically, but this is a move which may -- quite literally -- kill people.
I don't dispute your example, but I suspect there is a non-zero number of cases that would not be so extreme, so obviously identifiable.
So, sure, no panacea, but .. why not for the cases where it would be a barrier?
Because such cases don't really exist.
Your text used an unusual double ellipsis (" .. " instead of "... "), uncommon (not rare) generative vocabulary ("panacea"), etc. Statistics on those allows for pretty good re-identification.
Ditto for times you do things and work schedule.
Etc.
It's not "obviously identifiable," but a buffer overflow is not "obviously exploitable." Rather, it takes a very, very expert individual to write a script before everyone can exploit it.
Ditto here.
AOL found out and thus we all found out that you can't anonymize certain things, web searches in that case. I used to have bookmarked some literature from maybe ten years ago that said,(proved with math?), any moderate collection of data from or by individuals that fits certain criteria is de-anonymizeable, if not by itself, then with minimal extra data. I want to say it included if, for instance, instead of changing all occurances of genewitch to user9843711, every instance of genewitch was a different, unique id.
I apologize for not having cites or a better memory at this time.
> The root of this problem is the core problem with k-anonymity: there is no way to mathematically, unambiguously determine whether an attribute is an identifier, a quasi-identifier, or a non-identifying sensitive value. In fact, all values are potentially identifying, depending on their prevalence in the population and on auxiliary data that the attacker may have. Other privacy mechanisms such as differential privacy do not share this problem.
see also: https://en.wikipedia.org/wiki/Differential_privacy which alleges to solve this; that is, wiki says that the only attacks are side-channel attacks like errors in the algorithm or whatever.
If you squint a little, this problem is closely related to oblivious transfer as well
> She suggested that OpenAI could have taken steps to anonymize the chat logs but chose not to, only making an argument for why it "would not" be able to segregate data, rather than explaining why it "can’t."
Sounds like bullshit lawyer speak. What exactly is the difference between the two?
Not wanting to do something isn't the same thing as being unable to do something.
!define would
> Used to express desire or intent -- https://www.wordnik.com/words/would
!define cannot
> Can not ( = am/is/are unable to) -- https://www.wordnik.com/words/cannot
Who said anything about not wanting to?
"I will not be able to do this"
"I cannot do this"
There is no semantic or legal difference between the two, especially when coming from a tech company. Stalling and wordplay is a very common legal tactic when the side has no other argument.
The article is derived from the order, which is itself a short summary of conversations had in court.
https://cdn.arstechnica.net/wp-content/uploads/2025/06/NYT-v...
> I asked:
> > Is there a way to segregate the data for the users that have expressly asked for their chat logs to be deleted, or is there a way to anonymize in such a way that their privacy concerns are addressed... what’s the legal issue here about why you can’t, as opposed to why you would not?
> OpenAI expressed a reluctance for a "carte blanche, preserve everything request," and raised not only user preferences and requests, but also "numerous privacy laws and regulations throughout the country and the world that also contemplate these type of deletion requests or that users have these types of abilities."
A "reluctance to retain data" is not the same as "technically or physically unable to retain data". Judge decided OpenAI not wanting to do it was less important than evidence being deleted.
Disagree. There’s something about the “able” that implies a hindered routine ability to do something — you can otherwise do this, but something renders you unable.
“I won’t be able to make the 5:00 dinner.” -> You could normally come, but there’s another obligation. There’s an implication that if the circumstances were different, you might be able to come.
“I cannot make the 5:00 dinner.” -> You could not normally come. There’s a rigid reason for the circumstance, and there is no negotiating it.
If someone was in an accident that rendered them unable to walk, would you say they can or can not walk?
I’d just assume that any chat or api call you do to any cloud based ai in th US will be discoverable from here on out.
If that’s too big a risk it really is time to consider locally hosted LLMs.
That's always been the case for any of your data anywhere in any third party service of any kind, if it is relevant evidence in a lawsuit. Nothing specific to do with LLMs.
I ask again, why not anonymizing the data? That way NYT/the court could see if users are bypassing the paywall through ChatGPT while preserving privacy.
Even if I wrote it, I don't care if someone read out loud in public court "user <insert_hash_here> said: <insert nastiest thing you can think of here>"
You can't really anonymize the data if the conversation itself is full of PII.
I had colleagues chat with GPT, and they send all kinds of identifying information to it.