Courts have always had the power to compel parties to a current case to preserve evidence. (For example, this was an issue in the Google monopoly case, since Google employees were using chats set to erase after 24 hours.) That becomes an issue in the discovery phase, well after the defendant has an opportunity to file a motion to dismiss. So a case with no specific allegation of wrongdoing would already be dismissed.
The power does not extend to any of your hypotheticals, which are not about active cases. Courts do not accept cases on the grounds that some bad thing might happen in the future; the plaintiff must show some concrete harm has already occurred. The only thing different here is how much potential evidence OpenAI has been asked to retain.
> Courts have always had the power to compel parties to a current case to preserve evidence.
Not just that, even without a specific court order parties to existing or reasonably anticipated litigation have a legal obligation that attaches immediately to preserve evidence. Courts tend to issue orders when a party presents reason to believe another party is out of compliance with that automatic obligation, or when there is a dispute over the extent of the obligation. (In this case, both factors seem to be in play.)
Lopez v. Apple (2024) seems to be a recent and useful example of this; my lay understanding is that Apple was found to have failed in its duty to switch from auto-deletion (even if that auto-deletion was contractually promised to users) to an evidence-preservation level of retention, immediately when litigation was filed.
https://codiscovr.com/news/fumiko-lopez-et-al-v-apple-inc/
https://app.ediscoveryassistant.com/case_law/58071-lopez-v-a...
Perhaps the larger lesson here is: if you don't want your service provider to end up being required to retain your private queries, there's really no way to guarantee it, and the only real mitigation is to choose a service provider who's less likely to be sued!
(Not a lawyer, this is not legal advice.)
So if Amazon sues Google, claiming that it is being disadvantaged in search rankings, a court should be able to force Google to log all search activity, even when users delete it?
Yes. That's how the US court system works.
Google can (and would) file to keep that data private and only the relevant parts would be publicly available.
A core aspect to civil lawsuits is everyone gets to see everyone else's data. It's that way to ensure everything is on the up and up.
A great model – in a world without the Internet and LLMs (or honestly just full text search).
Maybe you misunderstood. The data is required to be retained, but there is no requirement to make it accessible to the opposition. OpenAI already has this data and presumably mines it themselves.
Courts generally require far more data to be retained than shared, even if this ask is much more lopsided.
If Amazon sues Google, a legal obligation to preserve all evidence reasonably related to the subject of the suit attaches immediately when Google becomes aware of the suit, and, yes, if there is a dispute about the extent of that obligation and/or Google's actual or planned compliance with it, the court can issue an order relating to it.
At Google's scale, what would be the hosting costs of this I wonder. Very expensive after a certain point, I would guess.
>At Google's scale, what would be the hosting costs of this I wonder. Very expensive after a certain point, I would guess.
Which would be chump change[0] compared to the costs of an actual trial with multiple lawyers/law firms, expert witnesses and the infrastructure to support the legal team before, during and after trial.
It can be just anonymised search history in this case.
> It can be just anonymised search history in this case.
Depending on the exact issues in the case, a court might allow that (more likely, it would allow only turning over anonymized data in discovery, if the issues were such that that there was no clear need for more) but generally the obligation to preserve evidence does not include the right to edit evidence or replace it with reduced-information substitutes.
We found that one was a bad idea in the earliest days of the web when AOL thought "what could the harm be?" about turning over anonymised search queries to researchers.
How did you go from a court order to persevere evidence and jump to dumping that data raw into the public record?
Courts have been dealing with discovery including secrets that litigants never want to go public for longer than AOL has existed.
That sounds impossible to do well enough without being accused of tampering with evidence.
Just erasing the userid isn’t enough to actually anonymize the data, and if you scrubbed location data and entities out of the logs you might have violated the court order.
Though it might be in our best interests as a society we should probably be honest about the risks of this tradeoff; anonymization isn’t some magic wand.
So then the courts need to find who is setting their chats do be deleted and order them to stop. Or find specific infringing chatters and order OpenAI to preserve these specified users’ logs. OpenAI is doing the responsible thing here.
OpenAI is the custodian of the user data, so they are responsible. If you wanted the court (i.e., the plaintiffs) to find specific infringing chatters, first they'd have to get the data from OpenAI to find who it is -- which is exactly what they're trying to do, and why OpenAI is being told to preserve the data so they can review it.
So the courts should start ordering all ISPs, browsers, and OSs to log all browsing and chat activity going forward, so they can find out which people are doing bad things on the internet.
No, they should not.
However, if the ISP, for instance, is sued, then it (immediately and without a separate court order) becomes illegal for them to knowingly destroy evidence in their custody relevant to the issue for which they are being sued, and if there is a dispute about their handling of particular such evidence, a court can and will order them specifically to preserve relevant evidence as necessary. And, with or without a court order, their destruction of relevant evidence once they know of the suit can be the basis of both punitive sanctions and adverse findings in the case to which the evidence would have been relevant.
> So the courts should start ordering all ISPs, browsers, and OSs to log all browsing and chat activity going forward, so they can find out which people are doing bad things on the internet.
Not "all", just the ones involved in a current suit. They already routinely do this anway (Party A is involved in a suit and is ordered to retain any and all evidence for the duration of the trial, starting from the first knowledge that Party A had of the trial).
You are mischaracterising what happens; you are presenting it as "Any court, at any time can order any party who is not involved in any suit in that sourt to forever hold user data"
That is not what is happening.
If those entities were custodians in charge of the data at hand in the court case, the court would order that.
This post appears to be full of people who aren’t actually angry at the results of this case but angry at how the US legal system has been working for decades, possibly centuries since I don’t know when this precedent was first set
Is it not valid to be concerned about overly broad invasions of privacy regardless of how long such orders have been occurring?
What privacy specifically? The courts have always been able to compel people to recount things they know which could include a conversation between you and your plumber if it was somehow related to a case.
The company records and uses this stuff internally, retention is about keeping information accurate and accessible.
Lawsuits allow in a limited context the sharing of non public information held by individuals/companies in the lawsuit. But once you submit something to OpenAI it’s now there information not just your information.
I think that some of the people here dislike (or are alarmed by) the way that the court can compel parties to retain data which would otherwise have vanished into the ether.
> I think that some of the people here dislike (or are alarmed by) the way that the court can compel parties to retain data which would otherwise have vanished into the ether.
Maybe so, but this has always been the case for hundreds of years.
After all, how on earth do you propose having getting fair hearing if the other party is allowed to destroy the evidence you asked for in your papers?
Because this is what would happen:
You: Your Honour, please ask the other party to turn over all their invoices for the period in question
Other Party: We will turn over only those invoices we have
*Other party goes back to the office and deletes everything.
The thing is, once a party in a suit asks for a certain piece of evidence, the other party can't turn around and say "Our policy is to delete everything, and our policy trumps the orders of this court".
I think your points are all valid, but… On the other hand, this sort of preservation does substantially reduce user privacy, disclosing personal information to unauthorized parties, with no guarantees of security, no audits, and few safeguards.
This is much more concerning (from a privacy perspective) than a company using cookies to track which pages on a website they’ve visited.
> On the other hand, this sort of preservation does substantially reduce user privacy,
Yes, that's by design and already hundreds of years old in practice.
You cannot refuse a court evidence to protect your or anyone else's privacy.
I see no reason to make an exception for rich and powerful companies.
I don't want a party to a suit having the ability to suppress evidence due to privacy concerns. There is no privacy once you get to a civil court other than what the court, at its discretion, allows, such as anonymisation.
I disagree because the information has already been recorded and users don’t have a say in who at the company or some random 3rd party the company sells that data to is “authorized” to view data.
It’s the collection itself that’s the problem not how soon it’s deleted as economically worthless.
> with no guarantees of security, no audits, and few safeguards.
The courts pay far more attention to that stuff than profit maximizing entities like OpenAI.
I agree that your assessment of the legal state-of-play is likely accurate. That said it is one thing for data to be cached in the short-term, and entirely different for it to be permanently stored and then sent out to parties which the user has only a distant and likely adversarial relationship with.
There are many situations in which the deletion/destruction of ‘worthless’ data is treated as a security protection. The one that comes to mind is how some countries destroy fingerprint data after it has been used for the creation of a biometric passport. Do you really think this is a futile act?
>”The courts pay far more attention to that stuff than profit maximizing entities like OpenAI.”
I would be interested to see evidence of this. The courts claim to value data security, but I have never seen an audit of discovery-related data storage, and I suspect there are substantial vulnerabilities in the legal system, including the law firms. Can a user hold the court or opposing law firm financially accountable if they fail to safeguard this data? I’ve never seen this happen.
> That said it is one thing for data to be cached in the short-term
Cashed data isn’t necessarily available for data retention to apply in the first place. Just because an ISP has parts of a message in some buffer doesn’t mean it’s considered as a recording of that data. If Google never stores queries beyond what’s needed to serve a response then it likely wouldn’t qualify.
Also, it’s on the entity providing data for the discovery process to do redaction as appropriate. The only way it ends up at the other end is if it gets sent in the first pace. There can be a lot of back and forth here and as to evidence that the courts care: https://www.law.cornell.edu/rules/frcp/rule_5.2
That is helpful, thanks, but I think it is not practical to redact LLM request information beyond the GDPR personally identifiable standards without just deleting everything. My (admittedly quick) read of those rules is that their ‘redacted’ information would still be readily identifiable anyway (not directly, but using basic data analysis). Their redaction standards for CC# and SIN are downright pathetic, and allow for easy recovery with modern techniques.
Its not an “invasion of privacy” for a company who already had data to be prohibited from destroying it when they are sued in a case where that data is evidence.
Yeah, sure. But understanding the legal system tells us the players and what systems exist that we might be mad at.
For me, one company obligated to retain business records during civil litigation against another company, reviewed within the normal discovery process is tolerable. Considering the alternative is lawlessness. I'm fine with it.
Companies that make business records out of invading privacy? They, IMO, deserve the fury of 1000 suns.
If you cared about your privacy, why are you handing all this stuff to Sam Altman? Did he represent that OpenAI would be privacy-preserving? Have they taken any technical steps to avoid this scenario?
Or you didn't read what was written by the other comment, or are just arguing in bad faith, what's even weierder because the guy was only explaining how the the system always worked
> So then the courts need to find who is setting their chats do be deleted and order them to stop.
No, actually, it doesn't. Ordering a party to stop destroying evidence relevant to a current case (which is its obligation even without a court order) irrespective of whether someone else asks it to destroy that evidence is both within the well-established power of the court, and routine.
> Or find specific infringing chatters and order OpenAI to preserve these specified users’ logs.
OpenAI is the alleged infringer in the case.
Under this theory, if a company had employees shredding incriminating documents at night, the court would have to name those employees before ordering them to stop.
That is ridiculous. The company itself receives that order, and is IMMEDIATELY legally required to comply - from the CEO to the newest-hired member of the cleaning staff.
Time does not need user logs to prove such a thing if it was true. Times can show that it is possible so they can show how their own users can access the text. Why would they need other user's data?
> Time does not need user logs to prove such a thing if it was true.
No it needs to show how often it happens to prove a point of how much impact its had.
Why would that matter, if people didn't use it as much, does it mean that it doesn't matter if there were few people?
> Why would that matter
Because its a copyright infringement case, so existence and the scale of the infringement is relevant to both whether there is liability and, if so, how much; the issue isn't that it is possible for infringement to occur.
You have to argue damages. It actually has to have cost NYT some money, and for that you need to know some extent.
We don't even know if Times uses AI to get information from other sources either. They can get a hint of news and then produce their material.
> We don't even know if Times uses AI to get information from other sources either
which is irrelevant at this stage. Its a legal principle that both sides can fairly discover evidence. As finding out how much openAI has infringed copyright is pretty critical to the case, they need to find out.
After all, if its only once or twice, thats a couple of dollars, if its millions of times, that hundreds of millions
OpenAI is also entitled to discovery. They can literally get every email and chat the times has and require from this point on they preserve such logs
Who cares? That's not a legal argument and it doesn't mean anything to this case.
Oh, I was unaware that Times was inventing a novel technology with novel legal questions.
It’s very impressive they managed to do such innovation in their spare time while running a newspaper and site
For the most part (there are a few exceptions), in the US lawsuits are not based on "possible" harm but actual observed harm. To show that, you need actual observed user behavior.
> Times can show that it is possible
The allegation is not that merely that infringement is possible; the actual occurrence and scale are relevant to the case.