Item 44186554 - HN

efskap • 2 days ago

Note that this also applies to GPT models on the API

> That risk extended to users of ChatGPT Free, Plus, and Pro, as well as users of OpenAI’s application programming interface (API), OpenAI said.

This seems very bad for their business.

11

merksittich • 2 days ago

Interesting detail from the court order [0]: When asked by the judge if they could anonymize chat logs instead of deleting them, OpenAI's response effectively dodged the "how" and focused on "privacy laws mandate deletion." This implicitly admits they don't have a reliable method to sufficiently anonymize data to satisfy those privacy concerns.

This raises serious questions about the supposed "anonymization" of chat data used for training their new models, i.e. when users leave the "improve model for all users" toggle enabled in the settings (which is the default even for paying users). So, indeed, very bad for the current business model which appears to rely on present users (voluntarily) "feeding the machine" to improve it.

[0] https://cdn.arstechnica.net/wp-content/uploads/2025/06/NYT-v...

Kon-Peki • 1 day ago

Thank you for the link to the actual text!

So, the NYT asked for this back in January and the court said no, but asked OpenAI if there was a way to accomplish the preservation goal in a privacy-preserving manner. OpenAI refused to engage for 5 f’ing months. The court said “fine, the NYT gets what they originally asked for”.

Nice job guys.

noworriesnate • 1 day ago

Nice find! Maybe this is a ploy by OpenAI to use API requests for training while blaming the courts?

blackqueeriroh • 1 day ago

That’s not an implicit admission, it’s refusing to argue something they don’t want to do.

neilv • 2 days ago

Some established businesses will need to review their contracts, regulations, and risk tolerance.

And wrapper-around-ChatGPT startups should double-check their privacy policies, that all the "you have no privacy" language is in place.

999900000999 • 2 days ago

I'm not going to look up the comment, but a few months back I called this out and said if you seriously want to use any LLM in a privacy sensitive context you need to self host.

For example, if there are business consequences for leaking customer data, you better run that LLM yourself.

TeMPOraL • 2 days ago

My standard reply to such comments over the past year has been the same: you probably want to use Azure instead. A big part of the business value they provide is ensuring regulatory compliance.

There are multinational corporations with heavy presence in Europe, that run their whole business on Microsoft cloud, including keeping and processing there privacy-sensitive data, business-critical data and medical data, and yes, that includes using some of this data with LLMs - hosted on Azure. Companies of this size cannot ignore regulatory compliance and hope no one notices. This only works because MS figured out how to keep it compliant.

Point being, if there are business consequences, you'll be better off using Azure-hosted LLMs than running a local model yourself - they're just better than you or me at this. The only question is, whether you can afford it.

jackvalentine • 2 days ago

I don't think Azure is the legal panacea you think it is for regulated industries outside of the U.S.

Microsoft v. United States (https://en.wikipedia.org/wiki/Microsoft_Corp._v._United_Stat...) showed the government wants, and was willing to do whatever required, access to data held in the E.U. The passing of the CLOUD Act (https://en.wikipedia.org/wiki/CLOUD_Act) basically codified it in to law.

brookst • 1 day ago

Compliant with EU consumer data regulations != panacea

TeMPOraL • 2 days ago

It might not be ultimately, but it still seems to be seen as such, as best I can tell, based on recent corporate experience and some early but very fresh research and conversations with legal/compliance on the topic of cloud and AI processing of medical data in Europe. Azure seems to be seen as a safe bet.

coliveira • 2 days ago

No, Azure is not gonna save you. The problem is that the US is a country in legal disarray, and they also pretend that their laws should be applied everywhere in the world. I feel that any US company can become a liability anywhere in the world. The Chinese are now feeling this better than anyone else, but the Europeans will also reach the same conclusion.

anonzzzies • 2 days ago

The US forces their laws everywhere and it needs to end. Everywhere we go, the fintech industry is really fed up with the US AML rules which are just blackmail: if your bank does not comply, America will mess you up financially. Maybe a lot more should just pull out and make people realise others can play this game. But that needs a USD collapse, otherwise it cannot work and I don't see that happening soon.

fancyfredbot • 2 days ago

AML and KYC are good things for almost everyone except criminals and the people who have to implement them.

cmenge • 2 days ago

Agree, and for the people who implement them -- yes, it's hard, it's annoying but presumably a well-paid job. And for the (somewhat established or well-financed) companies it's also a bit of a welcome moat I guess.

fancyfredbot • 1 day ago

Most regulation has the unfortunate side effect of protecting incumbents. I'm pretty sure the solution to this is not removing the regulations!

littlestymaar • 2 days ago

Regulatory compliance means nothing when the US regulations means they must give access to everything to intelligence services.

The European Court of Justice ruled at least twice that it doesn't matter what kind of contract they give you, and what kind of bilateral agreement there are between the US and the EU, as long as the US have the patriot act and later regulations, using Microsoft means it's violating European privacy laws.

lyu07282 • 2 days ago

How does that make sense if most EU corporations are using MS/Azure cloud/office/sharepoint solutions for everything? Are they just all in violation or what?

littlestymaar • 1 day ago

> Are they just all in violation or what?

Yes, and that's why the European Commission keeps being pushed back by the Court of Justice of the EU (the Safe Harbor was ruled out, Privacy Shield as well, and it's likely a matter of time before the CJEU kills the Data Privacy Framework as well), but when it takes 3-4 years to get a ruling and then the Commission can just make a new (illegal) framework that will last for a couple years, the violation can carry on indefinitely.

dncornholio • 2 days ago

You're just moving the same problem from OpenAI to Microsoft.

fakedang • 2 days ago

LoL, every boardroom in Europe is filled with talk of moving out of Microsoft. Not just Azure, Microsoft.

Of course, it could be just all talk, like all general European globalist talks, and Europe will do a 360 once a more friendly party takes over the US.

simiones • 1 day ago

You probably mean a 180 (or could call it a "365" to make a different kind of joke).

bgwalter • 1 day ago

It's a joke. The previous German Foreign Minister Baerbock has used 360° when she meant 180°, which became sort of a meme.

ziml77 • 1 day ago

It's been a meme for longer than that. The joke to bait people 20 years ago was "Why do they call it an Xbox 360? Because when you see it you turn 360 degrees and walk away"

brookst • 1 day ago

The problem is that the EU regulatory environment makes it impossible to build a homegrown competitor. So it will always be talk.

lyu07282 • 1 day ago

It seems that one side of the EU wants to ensure there is no competitors to US big tech and the other wants to work towards independence from US big tech. Both seem to use the privacy cudgel, require so much regulation that only US tech can hope to comply so nobody else competes with them, alternatively make it so nobody can comply, we just use fax machines again instead of the cloud?

Just hyperbole, but it seems the regulations are designed with the big cloud providers in mind, but then why don't they just ban US big tech and roll out the regulations more slowly? This neoliberalism makes everything so unnecessarily complicated.

BugheadTorpeda6 • 1 day ago

It would be interesting to see the hypothetical "return to fax machines" scenario.

If Solows paradox is true and not the result of bad measurement, then one might expect that it could be workable without sacrificing much productivity. Certainly abandoning the cloud would be possible if the regulatory environment allowed for rapid development of alternative non-cloud solutions, as I really don't think the cloud improved productivity (besides for software developers in certain cases) and is more of a rent seeking mechanism (hot take on hacker news I'm sure, but look at any big corpo IT dept outside the tech industry and I think you will see tons of instances where modern tech like the cloud is causing more problems than it's worth productivity-wise).

Computers in general I am much less sure of and lean towards mismeasurement hypothesis. I suspect any "return to 1950" project would render a company economically less competitive (except in certain high end items) and so the EU would really need to lean on Linux hard and invest massively in domestic hardware (not a small task as the US is finding out) in order to escape the clutches of the US and/or China.

I don't think they have the political will to do it, but I would love it if they tried and proved naysayers wrong.

Filligree • 2 days ago

Europe has seen this song and dance before. We’re not so sure there will ever be a more friendly party.

kortilla • 1 day ago

> you'll be better off using Azure-hosted LLMs than running a local model yourself - they're just better than you or me at this.

This is learned helplessness and it’s only true if you don’t put any effort into building that expertise.

TeMPOraL • 1 day ago

You mean become a lawyer specializing in regulations governing data protection, computing systems in AI, both EU-wide and at national level across all Europe, and with good understanding of relevant international treaties?

You're right, I should get right to it. Plenty of time for it after work, especially if I cut down HN time.

kortilla • 1 day ago

None of that is relevant for on-prem.

selfhoster11 • 2 days ago

Businesses in Trump's America can pinky-swear that they won't peek at your data to maintain "compliance" all they want. The fact is that this promise is not worth the paper it's (not) printed on, at least currently.

lynx97 • 2 days ago

Same for America under a democratic presidency. There is really no difference regarding trust in "promises".

jaggederest • 2 days ago

I've been poking around the medical / ehr LLM space and gently asking people how they're preserving privacy and everyone appears to be just shipping data to cloud providers based solely on a BAA. Kinda baffling to me, my first step would be to set up local models even if they're not as good, data breaches are expensive.

jackvalentine • 2 days ago

Same, and I've just sent an email up the chain to our exec saying 'hey remember those trials we're running and the promises the vendors have made? Here is why they basically can't be held to that anymore. This is a risk we highlighted at the start'

999900000999 • 2 days ago

Even Ollama + a 2K gaming computer (Nvidia) gets you most of the way there.

Technically you could probably just run it on EC2, but then you’d still need HIPPA compliance

fakedang • 2 days ago

And ironically because OpenAI is actually ClosedAI, the best self-hostable model available currently is a Chinese model.

anonymousiam • 2 days ago

Mistral AI is French, and it's pretty good.

https://en.wikipedia.org/wiki/Mistral_AI

fakedang • 2 days ago

I use Mistral often. But Deepseek is still a much better model than Mistral's best open source model.

mark_l_watson • 1 day ago

Perhaps except for coding? I find Mistral's codestral running on Ollama to be very good, and more practical for coding that running a distilled Deepseek R1 model.

fakedang • 1 day ago

Oh definitely, Mistral Code beats Deepseek for coding tasks. But for thinking tasks, Deepseek R1 is much better than all the self-hostable Mistral models. I don't bother with distilled - it's mostly useless, ChatGPT 3.5 level, if not worse.

HPsquared • 2 days ago

The only open part is your chat logs.

nfriedly • 2 days ago

*best with the exception of topics like tiananmen square

CjHuber • 2 days ago

As far as I remember the model itself is not censored it’s just on their chat interface. My experience was that it wrote about it but then just before finishing deleted what it wrote

int_19h • 2 days ago

It is somewhat censored, but when you're running models locally and you're in full control of the generation, it's trivial to work around this kind of stuff (just start the response with whatever tokens you want and let it complete; "Yes sir! Right away, sir!" works quite nicely).

Spivak • 2 days ago

Can confirm the model itself has no trouble talking about contentious issues in China.

nfriedly • 2 days ago

I haven't tried the full model, but I did try one of the distilled ones on my laptop, and it refused to talk about tiananmen square or other topics the CCP didn't want it to discuss.

ileonichwiesz • 2 days ago

What percentage of your LLM use is talking about Tiananmen Square?

nfriedly • 1 day ago

Well, for that one, it was a pretty high percentage. I asked it three or four questions like that and then decided I didn't trust it and deleted the model.

ted537 • 1 day ago

Yeah its an awkward position, as self-hosting is going to be insanely expensive unless you have a substantial userbase to amortize the costs over. At least for a model comparable to GPT-4o or deepseek.

But at least if you use an API in the same region as your customers, court order shenanigans won't get you caught between different jurisdictions.

999900000999 • 21 hours ago

Ideally smaller models will get better.

For most tasks I don't need the best model in existence, I just need good enough. A small law firm using LLMs for summaries can probably do it on prem and hire a smart college student to setup a PC to do it.

The problem is that's still more difficult ( let's say our hypothetical junior IT only makes 60k a year) than just sending all your private business information to some 3rd party API. You can then act shocked and concerned when your 3rd party leaks the data.

Etheryte • 2 days ago

In the European privacy framework, and legal framework at large, you can't terms of service away requirements set by the law. If the law requires you to keep the logs, there is nothing you can get the user to sign off on to get you out of it.

zombot • 2 days ago

OpenAI keeping the logs is the "you have no privacy" part. Anyone who inspects those logs can see what the users were doing. But now everyone knows they're keeping logs and they can't lie their way out of it. So, for your own legal safety, put it in your TOS. Then every user should know they can't use your service if they want privacy.

cj • 2 days ago

> Some established businesses will need to review their contracts, regulations, and risk tolerance.

I've reviewed a lot of SaaS contracts over the years.

Nearly all of them have clauses that allow the vendor to do whatever they have to if ordered to by the government. That doesn't make it okay, but it means OpenAI customers probably don't have a legal argument, only a philosophical argument.

Same goes for privacy policies. Nearly every privacy policy has a carve out for things they're ordered to do by the government.

Nasrudith • 1 day ago

Yeah. You basically need cyberpunk style corporate extraterritoriality to get that particular benefit, of being able to tell governments to go screw themselves.

Chris2048 • 2 days ago

Just to be pedantic, could the company encrypt the logs with a third-party key in escrow, s.t they would not be able to access that data, but the third party could provide access e.g. for a court.

HappMacDonald • 2 days ago

The problem ultimately isn't a technical one but a political one.

Point 1: Every company has profit incentive to sell the data in the current political climate, all they need is a sneaky way to access it without getting caught. That includes the combo of LLM provider and Escrow non-entity.

Point 2: No company has profit incentive to defend user privacy, or even the privacy of other businesses. So who could run the Escrow service? Another business? Then they have incentive to cheat and help the LLM provider access the data anyway. The government (and which one)? Their intelligence arms want the data just as much as any company does so you're back to square one again.

"Knowledge is power" combined with "Knowledge can be copied without anyone knowing" means that there aren't any currencies presently powerful enough to convince any other entity to keep your secrets for you.

Chris2048 • 2 days ago

But OpenAi/etc has the logs in the first place, so they can retain them if they wanted anyway. I thought the idea here is b/c they are now required to keep logs its always the case that they will retain them, hence this needs to be made clear i.e. "you will have no privacy"

But, since, I think, there are mechanisms by which they could keep logs, but in a way they cannot access them, they could still claim you will have privacy this way - even though they have the option to keep un-encrypted log, much like they could retain the logs in the first place. So the messaging may remain pretty much the same - from "we promise to delete your logs and keep no other copies, trust us" to "we promise to 3p-encrypt your archived logs and keep no other copies, trust us".

> No company has profit incentive to defend user privacy, or even the privacy of other businesses.

> They have incentive to cheat and help the LLM provider access the data anyway

Why would a company whose role is that of a 3p escrow be incentivised to risk their reputation by doing this? If that's the case every company holding PII has the same problem.

> Their intelligence arms want the data

In the EU at least, GDPR or similar. If you explicit law breaking, that's a more general problem. But what company has a "intelligence arms" in this manner? Are you talking about another big-tech corp?

I'd say this type of cheating it's be a risky proposition from the POV from that 3pe - it'd destroy their business, and they'd be penalised heavily b/c sharing keys is pretty explicitly illegal - any company caught could maybe reduce their own punishment by providing the keys as evidence of the 3pe crime. A viable 3pe business would also need multiple client companies to be viable, so you'd need all of them to play ball - a single whistle-blower in any of them will get you caught, and again, all they need is a single key to prove your guilt.

> "Knowledge is power" combined with "Knowledge can be copied without anyone knowing" means that there aren't any currencies presently powerful enough to convince any other entity to keep your secrets for you.

On that same basis, large banks could cheat the stock market; but there is regulation in place to address that somewhat.

Maybe 3p-escrows should be regulated more, or required to register as a currently-regulated type. That said, if you want to protect data from the government, prism etc, you're SOOL, no one can stop them cheating. let's focus on big-/tech/-startup cheats.

HappMacDonald • 1 day ago

Me> The government (and which one)? Their intelligence arms want the data just as much as any company does[..]

You> But what company has a "intelligence arms" in this manner? Are you talking about another big-tech corp?

"Their" in this circumstance refers to any government that might try to back Escrow.

Chris2048 • 19 hours ago

Sorry, b/c the question mark is outside the parens I read that as the end of the sentence.

Then I refer to my comment on prism: "if you want to protect data from the government, prism etc, you're SOOL, no one can stop them cheating. let's focus on big-/tech/-startup cheats."

Though you talk about "backing" escrow, I mean regulating. The government otherwise controls all business and society. How is it any different to the banks, sec companies etc in that respect.

Wowfunhappy • 2 days ago

> And wrapper-around-ChatGPT startups should double-check their privacy policies, that all the "you have no privacy" language is in place.

If a court orders you to preserve user data, could you be held liable for preserving user data? Regardless of your privacy policy.

gpm • 2 days ago

I don't think the suit would be against you preserving it, it would be against you falsely representing that you aren't preserving it.

A court ordering you to stop selling pigeons doesn't mean you can keep your store for pigeons open and pocket the money without delivering pigeons.

cortesoft • 2 days ago

Almost all privacy policies are going to have a call out for legal rulings. For example, here is the Hackernews Legal section in the privacy policy (https://www.ycombinator.com/legal/)

> Legal Requirements: If required to do so by law or in the good faith belief that such action is necessary to (i) comply with a legal obligation, including to meet national security or law enforcement requirements, (ii) protect and defend our rights or property, (iii) prevent fraud, (iv) act in urgent circumstances to protect the personal safety of users of the Services, or the public, or (v) protect against legal liability.

blibble • 2 days ago

most people aren't sharing internal company data with hacker news or reddit

cortesoft • 2 days ago

Sure, but my point is that most services will have something like this, no matter what data they have.

blitzar • 2 days ago

Not a lawyer, but I don't believe there is anything that any person or company can write on a piece of paper that supersedes the law.

simiones • 1 day ago

The point is not about superseding the law. The point is that if your company privacy policy says "we will not divulge this data to 3rd parties under any circumstance", and later they are served with a warrant to divulge that data to the government, two things are true:

- They are legally obligated to divulge that data to the government

- Once they do so, they are civilly liable for breach of contract, as they have committed to never divulging this data. This may trigger additional breaches of contract, as others may have not had the right to share data with a company that can share it with third parties

woliveirajr • 2 days ago

Yes. If your agreement with the end user says that you won't collect and store data, you're responsible for it. If you can't provide it (even if due to a court order), you have to adjust your contract.

Your users aren't obligated to know that you're using open ai or other provider.

pjc50 • 2 days ago

> If a court orders you to preserve user data, could you be held liable for preserving user data?

No, because you turn up to court and show the court order.

It's possible a subsequent case could get the first order overturned, but you can't be held liable for good faith efforts to comply with court orders.

However, if you're operating internationally, then suddenly it's possible that you may be issued competing court orders both of which are "valid". This is the CLOUD Act problem. In which case the only winning move becomes not to play.

simiones • 1 day ago

I'm pretty sure even in the USA, you could still be held liable for breach of contract, if you made representations to your customers that you wouldn't share data under any circumstance. The fact that you made a promise you obviously couldn't keep doesn't absolve you from liability for that promise.

pjc50 • 1 day ago

Can you find an example of that happening? For any "we promised not to do X but were ordered by a court to do it" event.

bilbo0s • 2 days ago

No. It’s a legal court order.

This, however, is horrible for AI regardless of whether or not you can sue.

dcow • 2 days ago

In the US you absolutely can challenge everything up and including the constitutionality of court orders. You may be swiftly dismissed if nobody thinks you have a valid case, but you can try.

johnQdeveloper • 2 days ago

> This seems very bad for their business.

Well, it is gonna be all _AI Companies_ very soon so unless everyone switches to local models which don't really have the same degree of profitability as a SaaS, its probably not going to kill a company to have less user privacy because tbh people are used to not having privacy these days on the internet.

It certainly will kill off the few companies/people trusting them with closed source code or security related stuff but you really should not outsource that anywhere.

csomar • 2 days ago

Did an American court just destroy all American AI companies in favor of open weight Chinese models?

pjc50 • 2 days ago

No, because users don't care about privacy all that much, and for corporate clients discovery is always a risk anyway.

See the whole LIBOR chat business.

thot_experiment • 2 days ago

afaik only OpenAI is enjoined in this

johnQdeveloper • 2 days ago

Correct, but lawsuits are gonna keep happening around AI, so it's really a matter of time.

> —after news organizations suing over copyright claims accused the AI company of destroying evidence.

Like, none of the AI companies are going to avoid copyright related lawsuits long term until things are settled law.

baby_souffle • 2 days ago

> afaik only OpenAI is enjoined in this

For now. This is going to devolve into either "openAI has to do this, so you do too" or "we shouldn't have to do this because nobody else does!" and my money is not on the latter outcome.

amanaplanacanal • 2 days ago

It's part of preserving evidence for an ongoing lawsuit. Unless other companies are party to the same suit, why would they have to?

csomar • 2 days ago

Sure. But this means the rest of the AI companies are exposed to such risk; and there aren't that many of them (grok/gemini/anthropic).

bsder • 2 days ago

> It certainly will kill off the few companies/people trusting them with closed source code or security related stuff but you really should not outsource that anywhere.

And how many companies have proprietary code hosted on Github?

johnQdeveloper • 2 days ago

None that I've worked for so I don't really track the statistics tbh.

We've always done self-hosted as old as things like gerrit and what not that aren't even really feature complete as competitors where I've worked.

mountainriver • 2 days ago

You can fine tune models on a multitenant base model and it’s often more profitable.

SchemaLoad • 2 days ago

>don't really have the same degree of profitability as a SaaS

They have a fair bit. Local models lets companies sell you a much more expensive bit of hardware. Once Apple gets their stuff together it could end up being a genius move to go all in on local after the others have repeated scandals of leaking user data.

johnQdeveloper • 2 days ago

Yes but it shifts all the value onto companies producing hardware and selling enterprise software to people who get locked into contracts. The market is significantly smaller # of companies and margins if they have to build value adds they won't charge for to move hardware.

ivape • 2 days ago

Going to drop a PG tweet:

https://x.com/paulg/status/1913338841068404903

"It's a very exciting time in tech right now. If you're a first-rate programmer, there are a huge number of other places you can go work rather than at the company building the infrastructure of the police state."

---

So, courts order the preservation of AI logs, and government orders the building of a massive database. You do the math. This is such an annoying time to be alive in America, to say the least. PG needs to start blogging again about what's going on now days. We might be entering the digital version of the 60s, if we're lucky. Get local, get private, get secure, fight back.

Kokouane • 2 days ago

If you were working with code that was proprietary, you probably shouldn't of been using cloud hosted LLMs anyways, but this would seem to seal the deal.

larrymcp • 2 days ago

I think you probably mean "shouldn't have". There is no "shouldn't of".

rimunroe • 2 days ago

Which gives you an opening for the excellent double contraction “shouldn’t’ve”

theoreticalmal • 2 days ago

My favorite variation of this is “oughtn’t to’ve”

bbarnett • 2 days ago

The letter H deserves better.

worthless-trash • 2 days ago

I think we gave it too much leeway in the word sugar.

mananaysiempre • 2 days ago

The funniest part is that in that contraction the first apostrophe does denote the elision of a vowel, but the second one doesn’t, the vowel is still there! So you end up with something like [nʔəv], much like as if you had—hold the rotten vegetables, please—“shouldn’t of” followed by a vowel.

Really, it’s funny watching from the outside and waiting for English to finally stop holding it in and get itself some sort of spelling reform to meaningfully move in a phonetic direction. My amateur impression, though, is that mandatory secondary education has made “correct” spelling such a strong social marker that everybody (not just English-speaking countries) is essentially stuck with whatever they have at the moment. In which case, my condolences to English speakers, your history really did work out in an unfortunate way.

veqq • 2 days ago

> phonetic

A phonetic respelling would destroy the languages, because there are too many dialects without matching pronunciations. Though rendering historical texts illegible, a phonemic approach would work: https://en.wiktionary.org/wiki/Appendix:English_pronunciatio... But that would still mean most speakers have 2-3 ways of spelling various vowels. There are some further problems with a phonemic approach: https://alexalejandre.com/notes/phonetic-vs-phonemic-spellin...

Here's an example of a phonemic orthography, which is somewhat readable (to me) but illustrates how many diacritics you'd need. And it still spells the vowel in "ask" or "lot" with the same ä! https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....

inkyoto • 2 days ago

> A phonetic respelling would destroy the languages, because there are too many dialects without matching pronunciations.

Not only that, but since pronunciation tends to diverge over time, it will create a never-ending spelling-pronunciation drift where the same words won't be pronounced the same in, e.g. 100-200 years, which will result in future generations effectively losing easy access to the prior knowledge.

selcuka • 2 days ago

> since pronunciation tends to diverge over time, it will create a never-ending spelling-pronunciation drift

Once you switch to a phonetic respelling this is no longer a frequent problem. It does not happen, or at least happens very rarely with existing phonetic languages such as Turkish.

In the rare event that the pronunciation of a sound changes in time, the spelling doesn't have to change. You just pronounce the same letter differently.

If it's more than one sound, well, then you have a problem. But it happens in today's non-phonetic English as well (such as "gost" -> "ghost", or more recently "popped corn" -> "popcorn").

veqq • 2 days ago

> Once you switch to a phonetic respelling this is no longer a frequent problem

Oh, but it does. It's just the standard is held as the official form of the language and dialects are killed off through standardized education etc. To do this in English would e.g. force all Australians, Englishmen etc. to speak like an American (when in the UK different cities and social classes have quite divergent usage!) This clearly would not work and would cause the system to break apart. English exhibits very minor diaglossia, as if all Turkic peoples used the same archaic spelling but pronounced it their own ways, e.g. tāg, kök, quruq, yultur etc. which Turks would pronounce as dāg, gök, yıldız etc. but other Turks today say gurt for kurt, isderik, giderim okula... You just say they're "wrong" because the government chose a standard and (Turkic people's outside of Turkey weren't forced to use it.)

As a native English speaker, I'm not even sure how to pronounce "either" (how it should be done in my dialect) and seemingly randomly reduce sounds. We'd have to change a lot of things before being able to agree on a single right version and slowly making everyone speak like that.

selcuka • 2 days ago

> dialects are killed off through standardized education etc.

Sorry, I didn't mean that it would be a smooth transition. It might even be impossible. What I wrote above is (paraphrasing myself) "Once you switch to a phonetic respelling [...] pronunciation [will not] tend to diverge over time [that much]". "Once you switch" is the key.

> To do this in English would e.g. force all Australians, Englishmen etc. to speak like an American

Why? There is nothing that prevents Australians from spelling some words differently (as we currently do, e.g. colour vs color, or tyre vs tire).

int_19h • 2 days ago

There's no particular reason why e.g. Australian English should have the same phonemic orthography as American English.

Nor is it some kind of insurmountable barrier to communication. For example, Serbian, Croatian, and Bosnian are all idiolects of the same language with some differences in phonemes (like i/e/ije) and the corresponding differences in standard orthographies, but it doesn't preclude speakers from understanding each other's written language anymore so than it precludes them from understanding each other's spoken language.

veqq • 2 days ago

> Serbian, Croatian and Bosnian

are based on the exact same Štokavian dialect, ignoring Kajkavian, Čajkavian, Čakavian and Torlakian dialects. There is _no_ difference in standard orthography, because yat reflexes have nothing to do with national boundaries. Plenty of Serbs speak Ijekavian, for example. Here is a dialect map: https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2Fc...

Your example is literally arguing that Australian English should have the same _phonetic_ orthography, even. But Australian English must have the same orthography or else Australia will no longer speak English in 2-3 generations. The difference between Australian and American English is far larger than between modern varieties of naš jezik. Australians code switches talking to foreigners while Serbs and Croats do not.

int_19h • 1 day ago

> There is _no_ difference in standard orthography, because yat reflexes have nothing to do with national boundaries

But there is, though, e.g. "dolijevati" vs "dolivati". And sure, standard Serbian/Montenegrin allows the former as well, but the latter is not valid in standard Croatian orthography AFAIK. That this doesn't map neatly to national borders is irrelevant.

If Australian English is so drastically different that Australians "won't speak English in 2-3 generations" if their orthography is changed to reflect how they speak, that would indicate that their current orthography is highly divergent from the actual spoken language, which is a problem in its own right. But I don't believe that this is correct - Australian English content (even for domestic consumption, thus no code switching) is still very much accessible to British and American English speakers, so any orthography that would reflect the phonological differences would be just as accessible.

veqq • 1 day ago

By tautology, if you split the language, you split the language. Different groups will exhibit divergent evolution.

> current orthography is highly divergent from the actual spoken language, which is a problem in its own right

The orthography is no more divergent to an Australians speech as to an American's speech, let alone a Londoner or Oxfordian. But why would it be a problem?

jenadine • 2 days ago

I think Norway did such a reform and they ended up with two languages now.

inkyoto • 2 days ago

Or, if one considers that Icelandic is/was the «orginal» Old West Norwegian language, Norway has ended up with •three* languages.

inkyoto • 2 days ago

The need for regular re-spelling and problems it introduces are precisely my point.

Consider three English words that have survived over the multiple centuries and their respective pronunciation in Old English (OE), Middle English around the vowel shift (MidE) and modern English, using the IPA: «knight», «through» and «daughter»:

  «knight»:  [knixt] or [kniçt] (OE) ↝ kniçt] or [knixt] (MidE) ↝ [naɪt] (E)

  «through»: [θurx] (OE) ↝ [θruːx] or [θruɣ] (MidE) ↝ [θruː] (E)

  «daughter»: [ˈdoxtor] (OE) ↝ [ˈdɔuxtər] or [ˈdauxtər] (MidE) ↝ [ˈdɔːtə] (E)

It is not possible for a modern English speaker to collate [knixt] and [naɪt], [θurx] and [θruː], [ˈdoxtor] and [ˈdɔːtə] as the same word in each case.

Regular re-spelling results in a loss of the linguistic continuity, and particularly so over a span of a few or more centuries.

inglor_cz • 2 days ago

Interesting, just how much the Old English words sound like modern German: Knecht, durch and Tochter. Even after 1000 years have elapsed.

kragen • 2 days ago

Modern German didn't undergo the Norman Conquest, a mass influx of West African slaves, or an Empire on which the Sun never set, so it is much more conservative. The incredible thing about the Norman Conquest, linguistically speaking, is that English survived at all.

veqq • 1 day ago

The great vowel shift happened in the 16th century and is responsible for most of these changes. The original grammatical simplification (loss of cases etc.) between 10-1300 is difficult to ascribe, as similar happened in continental Scandinavian languages (and the Swedes had their own vowel dance!) But the shift in words themselves came much after (and before empire).

simiones • 1 day ago

English also shows a remarkable variation in pronunciation of words even for a single person. I don't know of any other language where, even in careful formal speech, words can just change pronunciation drastically based on emphasis. For example, the indefinite article "a" can be pronounced as either [ə] (schwa, for the weak form) or "ay" (strong form). "the" can be "thə" or "thee". Similar things happen with "an", "can", "and", "than", "that" and many, many other such words.

roywiggins • 2 days ago

We had a spelling reform or two already, they were unfortunately stupid, eg doubt has never had the b pronounced in English. https://en.m.wiktionary.org/wiki/doubt

That said, phonetic spelling reform would of course privilege the phonemes as spoken by whoever happens to be most powerful or prestigious at the time (after all, the only way it could possibly stick is if it's pushed by the sufficiently powerful), and would itself fall out of date eventually anyway.

jdbernard • 2 days ago

> but the second one doesn’t, the vowel is still there!

Isn't the "a" in "have" elided along with the "h?"

Shouldn't've Should not have

What am I missing?

jack09268 • 2 days ago

Even though the vowel "a" is dropped from the spelling, if you actually say it out loud, you do pronounce a vowel sound when you get to that spot in the word, something like "shouldn'tuv", whereas the "o" in "not" is dropped from both the spelling and the pronounciation.

SAI_Peregrinus • 2 days ago

The pronounced vowel is different than the 'a' in 'have'. And the "h" is definitely elided.

int_19h • 2 days ago

Many English dialects elide "h" at the beginning even when nothing is contracted. The pronounced vowel is different mostly because it's unstressed, and unstressed vowels in English generally centralize to schwa or nearly so.

dan353hehe • 2 days ago

Don’t worry about us. English is truly a horrible language to learn, and I feel bad for anyone who has to learn it.

Also I have always liked this humorous plan for spelling reform: https://guidetogrammar.org/grammar/twain.htm

shagie • 1 day ago

The node for it on Everything2 makes it a little bit easier to follow with links to the English word. https://everything2.com/title/A+Plan+for+the+Improvement+of+...

So, its something like:

    For example, in Year 1 that useless letter "c" would be dropped to be [replased](replaced) either by "k" or "s", and likewise "x" would no longer be part of the alphabet.

It becomes quite useful in the later sentences as more and more reformations are applied.

throwawaymb • 1 day ago

English being particularily difficult is just a meme. only the orthography is confusing.

amanaplanacanal • 2 days ago

English spelling is pretty bad, but spoken English isn't terrible, is it? It's the most popular second language.

int_19h • 2 days ago

English is rather complex phonologically. Lots of vowels for starters, and if we're talking about American English these include the rather rare R-colored vowels - but even without them things are pretty crowded, e.g. /æ/ vs /ɑ/ vs /ʌ/ ("cat" vs "cart" vs "cut") is just one big WTF to anyone whose language has a single "a-like" phoneme, which is most of them. Consonants have some weirdness as well - e.g. a retroflex approximant for a primary rhotic is fairly rare, and pervasive non-sibilant coronals ("th") are also somewhat unusual.

There are certainly languages with even more spoken complexity - e.g. 4+ consonant clusters like "vzdr" typical of Slavic - but even so spoken English is not that easy to learn to understand, and very hard to learn to speak without a noticeable accent.

somenameforme • 2 days ago

You never realize how many weird rules, weird exceptions, ambiguities, and complete redundancies there are in this language until you try to teach English, which will also probably teach you a bunch of terms and concepts you've never heard of. Know what a gerund is? Then there's things we don't even think about that challenge even advanced foreign learners, like when you use which articles: the/a.

English popularity was solely and exclusively driven by its use as a lingua franca. As times change, so too will the language we speak.

huimang • 1 day ago

Every real, non-constructed language has weird rules, weird exceptions, ambiguities, and complete redundancies. English is on the more difficult end but it's not nearly the most difficult. I'm not sure how it got to be perceived as this exceptionally tough language just because pronunciation can be tough. Other languages have pronunciation ambiguities too...

throwawaymb • 1 day ago

English is far from the most complex or difficult.

pjc50 • 2 days ago

The thing is that English takes in words from other languages and keeps doing so, which means that there are several phonetic systems in use already. It's just that they use the same alphabet so you can't tell which one applies to which word.

There are occasional mixed horrors like "ptarmigan", which is a Gaelic word which was Romanized using Greek phonology, so it has the same silent p as "pterodactyl".

There's no academy of the English language anyway, so there's nobody to make such a change. And as others have said, the accent variation is pretty huge.

knicholes • 2 days ago

I care.

amanaplanacanal • 2 days ago

That used to be the case, but "shouldn't of" is definitely becoming more popular, even if it seems wrong. Languages change before our eyes :)

DecentShoes • 2 days ago

Who cares?

YetAnotherNick • 2 days ago

Why not? Assuming you believe you can use any cloud for backup or Github for code storage.

solaire_oa • 2 days ago

IIUC one reason is that prompts and other data sent to 3rd party LLM hosts have the chance to be funneled to 4th party RLHF platforms, e.g. Sagemaker, Mechanical Turks, etc. So a random gig worker could be reading a .env file the intern uploaded.

YetAnotherNick • 2 days ago

What do you mean by chance? It's clear that if users have not opted out from training the models, it would be used. If they have opted out, it wont be used. And most of the users are in first bucket.

Just because training on data is opt out doesn't mean business can't trust it. Not the best for user's privacy though.

gpm • 2 days ago

I think it's fair to question how proprietary your data is.

Like there's the algorithm by which a hedge fund is doing algorithmic trading, they'd be insane to take the risk. Then there's the code for a video game, it's proprietary, but competitors don't benefit substantially from an illicit copy. You ship the compiled artifacts to everyone, so the logic isn't that secret. Copies of the similar source code have linked before with no significant effects.

FuckButtons • 2 days ago

AFAIK, the actual trading algorithms themselves aren’t usually that far from what you can find in a textbook, their efficacy is mostly dictated by market conditions and the performance characteristics of the implementation / system as a whole.

short_sells_poo • 1 day ago

This very much "depends".

Many algo strategies are indeed programmatically simple (e.g. use some sort of moving average), but the parametrization and how it's used is the secret sauce and you don't want that information to leak. They might be tuned to exploit a certain market behavior, and you want to keep this secret since other people targeting this same behavior will make your edge go away. The edge can be something purely statistical or it can be a specific timing window that you found, etc.

It's a bit like saying that a Formula 1 engine is not that far from what you'd find in a textbook. While it's true that it shares a lot of properties with a generic ICE, the edge comes from a lot of proprietary research that teams treat as secret and definitely don't want competitors to find out.

short_sells_poo • 1 day ago

Most (all?) hedge funds that use AI models explicitly run in-house. People do use commercial LLMs, but in cases where the LLMs are not run in-house, it's against the company policy to upload any proprietary information (and generally this is logged and policed).

A lot of the use is fairly mundane and basically replaces junior analysts. E.g. it's digesting and summarizing the insane amounts of research that is produced. I could ask an intern to summarize the analysis on platinum prices over the last week, and it'll take them a day. Alternatively, I can feed in all the analysis that banks produce to an LLM and have it done immediately. The data fed in is not a trade secret really, and neither is the output. What I do with the results is where the interesting things happen.

consumer451 • 2 days ago

All GPT integrations I’ve implemented have been via Azure’s service, due to Microsoft’s contractual obligation for them not to train on my data.

As far as I understand it, this ruling does not apply to Microsoft, does it?

Descon • 2 days ago

I think when you spin up open AI in azure, that instance is yours, so I don't believe that would be subject to this order.

tbrownaw • 2 days ago

The plans scale down far enough that they can't possibly cover the cost of a private model-loaded-to-vram instance at the low end.

dinobones • 2 days ago

How? This is retention for legal risk, not for training purposes.

They can still have legal contracts with other companies, that stipulate that they don't train on any of their data.

paxys • 2 days ago

Your employees' seemingly private ChatGPT logs being aired in public during discovery for a random court case you aren't even involved in is absolutely a business risk.

lxgr • 2 days ago

I get where it's historically coming from, but the combination of American courts having almost infinite discovery rights (to be paid by the losing party, no less, greatly increasing legal risk even to people and companies not out to litigate) and the result of said discoveries ending up on the public record seems like a growing problem.

There's a qualitative difference resulting from quantitatively much easier access (querying some database vs. having to physically look through court records) and processing capabilities (an army of lawyers reading millions of pages vs. anyone, via an LLM) that doesn't seem to be accounted for.

amanaplanacanal • 2 days ago

I assume the folks who are concerned about their privacy could petition the court to keep their data confidential.

MatthiasPortzel • 2 days ago

I occasionally use ChatGTP and I strongly object to the court forcing the collection of my data, in a lawsuit I am not named in, due merely to the possibility of copyright infringement. If I’m interested in petitioning the court to keep my data private, as you say is possible, how would I go about that?

Of course I haven’t sent anything actually sensitive to ChatGTP, but the use of copyright law in order to enforce a stricter surveillance regime is giving very strong “Right to Read” vibes.

> each book had a copyright monitor that reported when and where it was read, and by whom, to Central Licensing. (They used this information to catch reading pirates, but also to sell personal interest profiles to retailers.)

> It didn’t matter whether you did anything harmful—the offense was making it hard for the administrators to check on you. They assumed this meant you were doing something else forbidden, and they did not need to know what it was.

=> https://www.gnu.org/philosophy/right-to-read.en.html

anticensor • 2 days ago

They can, but are they willing to do that?

pjc50 • 2 days ago

People need to read up on the LIBOR scandal. There was a lot of "wait why are my chat logs suddenly being read out as evidence of a criminal conspiracy".

godelski • 2 days ago

Retention means an expansion of your threat model. Specifically, in a way you have little to no control over.

It's one thing if you get pwned because a hacker broke into your servers. It is another thing if you get pwned because a hacker broken into somebody else's servers.

At this point, do we believe OpenAI has a strong security infrastructure? Given the court order, it doesn't seem possible for them to have sufficient security for practical purposes. Your data might be encrypted at rest, but who has the keys? When you're buying secure instances, you don't want the provider to have your keys...

bcrosby95 • 2 days ago

Isn't it a risk even if they retain nothing? Likely less of a risk, but it's still a risk that you have no way to deep dive on, and you can still get "pwned" because someone broke into their servers.

fc417fc802 • 2 days ago

The difference between maintaining an active compromise versus obtaining all past data at some indeterminate point in the future is huge. There's a reason cryptography protocols place so much significance on forward secrecy.

godelski • 1 day ago

There's always risk. It's all about reducing risk.

Look at it this way. If you your phone was stolen would you want it to self destruct or keep everything? (Assume you can decide to self destruct it) clearly the latter is safer. Maybe the data has been pulled off and you're already pwned. But by deleting, if they didn't get the data they now won't be able to.

You just don't want to give adversaries infinite time to pwn you

antihipocrat • 2 days ago

Will a business located in another jurisdiction be comfortable that the records of all staff queries & prompts are being stored and potentially discoverable by other parties? This is more than just a Google search, these prompts contain business strategy and IP (context uploads for example)

CryptoBanker • 2 days ago

Right, because companies always follow the letter of their contracts.

lxgr • 2 days ago

Why would the reason matter for people that don't want their data retained at all?

Take8435 • 2 days ago

...Data that is kept can be exfiltrated.

fn-mote • 2 days ago

Cannot emphasize this enough. If your psychologist’s records can be held for ransom, surely your ChatGPT queries will end up on the internet someday.

Do search engine companies have this requirement as well? I remember back in the old days deanonymizing “anonymous” query logs was interesting. I can’t imagine there’s any secrecy left today.

SchemaLoad • 2 days ago

I recently had a high school assignment document get posted on a bunch of sites that sell homework help. As far as I know that document was only ever submitted directly to the assignment upload page. So somewhere along the line, I suspect on the plagiarism checker service, there was a hack and then 10 years later some random school assignment with my name on it is all over the place.

genewitch • 2 days ago

How did you find out?

jameshart • 1 day ago

Thinking about the value of the dataset of Enron’s emails that was disclosed during their trials, imagine the value and cost to humanity of all OpenAI’s api logs even for a few months being entered into court record..

ukuina • 2 days ago

Aren't most enterprise customers using AzureOpenAI?

bigfudge • 2 days ago

Will this apply to Azure OpenAI model APIs too?

m3kw9 • 2 days ago

Not when people have nowhere else to go, pretty much you cannot escape it, it’s too convenient to not use now. You think no other AI chat providers doesn’t need to do this?