How? This is retention for legal risk, not for training purposes.
They can still have legal contracts with other companies, that stipulate that they don't train on any of their data.
Your employees' seemingly private ChatGPT logs being aired in public during discovery for a random court case you aren't even involved in is absolutely a business risk.
I get where it's historically coming from, but the combination of American courts having almost infinite discovery rights (to be paid by the losing party, no less, greatly increasing legal risk even to people and companies not out to litigate) and the result of said discoveries ending up on the public record seems like a growing problem.
There's a qualitative difference resulting from quantitatively much easier access (querying some database vs. having to physically look through court records) and processing capabilities (an army of lawyers reading millions of pages vs. anyone, via an LLM) that doesn't seem to be accounted for.
I assume the folks who are concerned about their privacy could petition the court to keep their data confidential.
I occasionally use ChatGTP and I strongly object to the court forcing the collection of my data, in a lawsuit I am not named in, due merely to the possibility of copyright infringement. If I’m interested in petitioning the court to keep my data private, as you say is possible, how would I go about that?
Of course I haven’t sent anything actually sensitive to ChatGTP, but the use of copyright law in order to enforce a stricter surveillance regime is giving very strong “Right to Read” vibes.
> each book had a copyright monitor that reported when and where it was read, and by whom, to Central Licensing. (They used this information to catch reading pirates, but also to sell personal interest profiles to retailers.)
> It didn’t matter whether you did anything harmful—the offense was making it hard for the administrators to check on you. They assumed this meant you were doing something else forbidden, and they did not need to know what it was.
People need to read up on the LIBOR scandal. There was a lot of "wait why are my chat logs suddenly being read out as evidence of a criminal conspiracy".
Retention means an expansion of your threat model. Specifically, in a way you have little to no control over.
It's one thing if you get pwned because a hacker broke into your servers. It is another thing if you get pwned because a hacker broken into somebody else's servers.
At this point, do we believe OpenAI has a strong security infrastructure? Given the court order, it doesn't seem possible for them to have sufficient security for practical purposes. Your data might be encrypted at rest, but who has the keys? When you're buying secure instances, you don't want the provider to have your keys...
Isn't it a risk even if they retain nothing? Likely less of a risk, but it's still a risk that you have no way to deep dive on, and you can still get "pwned" because someone broke into their servers.
The difference between maintaining an active compromise versus obtaining all past data at some indeterminate point in the future is huge. There's a reason cryptography protocols place so much significance on forward secrecy.
There's always risk. It's all about reducing risk.
Look at it this way. If you your phone was stolen would you want it to self destruct or keep everything? (Assume you can decide to self destruct it) clearly the latter is safer. Maybe the data has been pulled off and you're already pwned. But by deleting, if they didn't get the data they now won't be able to.
You just don't want to give adversaries infinite time to pwn you
Will a business located in another jurisdiction be comfortable that the records of all staff queries & prompts are being stored and potentially discoverable by other parties? This is more than just a Google search, these prompts contain business strategy and IP (context uploads for example)
Why would the reason matter for people that don't want their data retained at all?
...Data that is kept can be exfiltrated.
Cannot emphasize this enough. If your psychologist’s records can be held for ransom, surely your ChatGPT queries will end up on the internet someday.
Do search engine companies have this requirement as well? I remember back in the old days deanonymizing “anonymous” query logs was interesting. I can’t imagine there’s any secrecy left today.
I recently had a high school assignment document get posted on a bunch of sites that sell homework help. As far as I know that document was only ever submitted directly to the assignment upload page. So somewhere along the line, I suspect on the plagiarism checker service, there was a hack and then 10 years later some random school assignment with my name on it is all over the place.