Bluesky's open API means anyone can scrape your data for AI training

paxys • 1 hour ago

I feel like there's a fundamental disconnect between the hacker ethos of "everything online is fair game, let's scrape a ton of data and see what cool stuff we can do" and the more...let's say European ideas about data and privacy rights. Neither one is inherently right or wrong, but they both cannot coexist on the internet.

PikachuEXE • 1 hour ago

Then what's the point for those moving from X to Bluesky?

AFAIK many are moving due to content might be used to train AI models.

ra_men • 1 hour ago

Very few people are switching to Bluesky so they don't have their data scraped for AI models. X has become a messy cesspool of spammy ads, toxic comments, and an owner intent on making it his personal echo chamber.

muglug • 1 hour ago

"They'll train an AI on my deeply-valuable thoughts" is the absolute least of my worries about having an account on Twitter.

It's more just about me not wanting to stay in the place I grew up after all my friends moved away, especially after a whole bunch of slightly terrible people moved in next door.

GermanJablo • 1 hour ago

I don't know where you got that idea. People are leaving because they don't like Elon.

whatasaas • 2 hours ago

Here come the flurry of scary articles looking to find that angle. Can’t call this one a foreign actor so now openness is “bad”.

4 replies

NavinF • 2 hours ago

The article takes that angle because Bluesky users are currently harassing and threatening a researcher. Read the replies to his original thread: https://bsky.app/profile/danielvanstrien.bsky.social/post/3l...

From the article:

>Per a report by 404 Media, Daniel van Strien, a machine learning librarian at AI firm Hugging Face, pulled 1 million public posts from Bluesky via its Firehose API for machine learning research, pushing the dataset to a public repository. Van Strien later removed the data due to the controversy that ensued; however, it serves as a timely reminder that everything you post publicly to Bluesky is, well, public.

tacticus • 2 hours ago

So what was the license on the data?

HF certainly have a restrictive license on data they produce.

paxys • 1 hour ago

"Harassing and threatening" by asking him not to use their data illegally and without consent?

cle • 1 hour ago

> By creating an account you agree to the Terms of Service and Privacy Policy.

> The Bluesky App is a microblogging service for public conversation, so any information you add to your public profile and the information you post on the Bluesky App is public.

What did they expect? I don't know how they could make it more clear. It doesn't say "information you post will only be used by people you like". It is public.

Anybody can write like 10 lines of Python and start consuming the firehose.

paxys • 1 hour ago

Copyright laws don't cease to be a thing the moment you post something on the internet. The words are still yours. If this was X/Reddit/Facebook or the like instead of Bluesky the researcher would have immediately found himself on the wrong end of a DMCA takedown request and maybe even a lawsuit.

jazzyjackson • 1 hour ago

A lawsuit such as LinkedIn v. hiQ ?

The concluding scraping publically accessible data was not a violation of CFAA, after which Twitter et al went logged-in-users-only?

I don't know if copyright comes into this. While social media terms of service are very clear about licensing the comments of individual users, that's to protect them from what the law doesn't say implicitly. Is every comment posted to Twitter or Bluesky a "literary work" ? An original, creative expression ? I have my doubts but I guess there's room for a lawsuit yet.

https://bsky.app/profile/danielvanstrien.bsky.social/post/3l...

paxys • 1 hour ago

Why are Google, OpenAI etc. buying user generated data from Reddit and other similar sites for hundreds of millions of dollars a year? If Reddit comments aren't copyrighted and aren't original creative work then anyone should just be free to scrape them directly right?

jazzyjackson • 1 hour ago

I didn't see a lot of asking in that thread. They should probably organize and sue if they want to clarify what is legal, as is the bulk of the comments I read are denigrating and insulting

> Your job is making the world a worse place

> Hey man, fuck you

> fuck you nerd

> Take your AI horseshit out of here you horrible piece of human trash.

> have you considered killing yourself

> This is garbage, you're a thief, and you disgust me.

> Throw yourself off a building [thumbs up emoji]

> Somebody cyberbully this guy

jazzyjackson • 1 hour ago

See also the comments where Daniel apologizes. Bluesky hasnt changed anything about the shape of Twitter and the kind of harassment that occurs. A mob of people have some vague enemy (AI bros) and pile on the first target of opportunity directing all their "kill yourself" energy at one individual at a time.

nomilk • 1 hour ago

The hate doesn't prevent the research taking place, but ensures it won't occur publicly where it's more useful.

coding123 • 2 hours ago

There is literally no one that can be pleased anymore.

Dban1 • 2 hours ago

try drawing a straight line through 8+ billion scattered dots.

__loam • 2 hours ago

Bluesky themselves have made it abundantly clear they're with the artists on this one. The firehose isn't a license to use someone's art for commercial software development.

crowcroft • 2 hours ago

Is that worse than Meta, X or Google hoarding up for themselves?

Or is it worse than Reddit selling to anyone with money?

nicce • 1 hour ago

Maybe someone is upset that they can’t get competitive advantage by throwing money.

krsdcbl • 1 hour ago

Maybe it's just me, or I might be missing a relevant implication - but I'm having a hard time understanding why so many people have become alarmist about the fact, that things that they publish on the web, can and will be scraped?

0xDEAFBEAD • 29 minutes ago

Might just be one of these: https://en.wikipedia.org/wiki/Availability_cascade

I do find PG's idea of "aggressively conventional-minded" people to be a useful concept: https://paulgraham.com/conformism.html

ks2048 • 1 hour ago

It seems to be mainly a reaction against AI (as opposed to scraping in-general, e.g. for a search engine).

I'm not saying it makes sense, but there is a large and growing idea of: I want my content out in the world, but I don't want companies to use it for training AIs, especially for profit.

OutOfHere • 1 hour ago

I don't see how Bluesky is sustainable. Who's paying for hosting the main instance of Bluesky? Who's paying for the firehose? How long will this last?

https://bsky.social/about/blog/7-05-2023-business-plan

0xDEAFBEAD • 23 minutes ago

>We believe that there must be better strategies to sustain social networks that don’t require selling user data for ads. Our first step in another direction is paid services, and we’re starting with custom domains...

>...

>We’re partnering with Namecheap, a popular domain registrar, to offer a service for easy domain purchasing and management.

jazzyjackson • 19 minutes ago

Blockchain Capital

https://www.blockchaincapital.com/blog/bluesky-13m-users-and...

OutOfHere • 5 minutes ago

So that's an investor, but it's not a revenue or profit model. The article notes:

> The future for Bluesky includes expanding their go-to-market efforts, building out their product roadmap with new features like subscription-based profile customizations, and further engaging with their growing developer community, all while maintaining their commitment to a free and accessible platform.

Is "subscription-based profile customizations" a sufficient revenue model? The investor will at some point want a return.

dlachausse • 1 hour ago

It’s a solid business model just like the Underpants Gnomes from South Park…

1. Build Twitter clone

2. ???

3. Profit

anonzzzies • 1 hour ago

People got angry on bluesky and say it has to be forbidden; you either want open or closed. If closed, your data is still used to train AI, but the owner of the network is making money with it. Better if anyone can get it; not like it is stoppable anymore anyway.

jazzyjackson • 1 hour ago

There's a lot that could be said for the behavior of Bluesky users and moderators, but aside from the practical matter that anyone can scrape your data, there seems to be some confusion on users' part about what it means that they have "ownership" of their data. You license it to Bluesky, but in the absence of other licensing agreements (or as Gen Z likes to say, "communication of consent preferences"), is there no way to prevent what you do in public from being absorbed by the facehuggers and monetized thereafter?

nomilk • 1 hour ago

Does Bluesky's decentralised nature make it hard/impossible to apply a bot blocker^ like cloudflare?

^ More accurately, cloudflare is a bot-slower as it (and services like it) make it slower and most costly to scrape data, but not impossible.

jazzyjackson • 1 hour ago

Each relay could implement bot rejection as a means of saving bandwidth but the whole point of the architecture is that the firehouse can be mirrored

bastard_op • 2 hours ago

That way they can claim it's not just their scabbing up your data but it's a design feature in case you try to sue us to resell. For the betterment of humanity and such.

2OEH8eoCRo0 • 1 hour ago

Good. Just like the open web.

forgotoldacc • 2 hours ago

Does bsky have an actual plan for how they'll make money long term?

The users are all in a honeymoon phase about how it's so different from twitter, but it seems like it's only a matter of time until they're directly selling data for AI training, offering paid corporate accounts, intrusive ads, etc. In the end, I imagine it'll end up just like twitter but with a different CEO. I use it simply because lots of others who I follow migrated, but I'm under no delusion that it'll be any different long term.

paxys • 1 hour ago

Long long term is anyone's guess, but I think if the team stays mission driven and doesn't get distracted they won't have any problems with money. The company is still owned by employees, and the entire team is just 20 people. They don't seem to have a bloated stack or unnecessary features. They got a sizable initial war chest of $14 million from Twitter. When that runs out I bet adding just a "donate" button somewhere on the site will be enough to cover expenses.

https://www.blockchaincapital.com/blog/bluesky-13m-users-and...

forgotoldacc • 21 minutes ago

14 million pays for a couple salaries and a few server bills. That burns away fast when users get higher and higher into the tens of millions.

Being mission driven and not getting distracted is nice, but that doesn't put money in their pockets. Firefox and other products that are/were beloved by similar crowd have been asking for donations for years and it hasn't worked out. Wikipedia seems like the rare exception that pulls it off, and that's because they're aggressive with their campaigning for cash and they're basically an indispensable public service at this point.

jazzyjackson • 1 hour ago

Already raised another 15 million from Blockchain Capital. Not a donation, but a Series A. They could have walked the nonprofit donation route like Signal but they dance with VCs instead.

ks2048 • 1 hour ago

There's a statement here (July 2023) with some info,

https://bsky.social/about/blog/7-05-2023-business-plan

but selling domain services doesn't seem it will go very far. I've seen some other rumors about paid accounts for extra features (posting longer or high-quality videos, etc).

Does anyone know if being a "public benefit corporation" is significant? Or will the same monetary pressures build up?

dartos • 1 hour ago

Well they recently published the AT protocol that they use.

It’s an open protocol, so they’re going for some kind of community angle maybe?

They could probably sell instances like mastodon, but I don’t think that will scale how they probably want.

I think they will eventually find a way to either monetize the data or add an ad extension to AT proto.

sergiotapia • 2 hours ago

2 million bluesky posts if you want to use it for something:

https://x.com/AlpinDale/status/1861819574259192082

perihelions • 1 hour ago

That's a very depressing and unpleasant-looking dataset. I'm don't know if that reflects more on BlueSky, or on the "uniform random sampling" aspect.

If we sampled HN data randomly, would *we* look that bad?

lupire • 2 hours ago

A book's open text interface means anyone can scrape your data for AI training.

smashah • 2 hours ago

An open and accessible API is good actually. "Bad" actors will always find a way around closed/restricted/no API access.

perihelions • 2 hours ago

What's the point of "enable[ing] users to communicate their consent preferences", when none of the entities who receive those preferences are under any obligation or restraint to respect them?

This is a "close elevator" button. It's a placebo button your users can press that can make them feel—without any basis in reality—more safe, more private, more $whatever. It's deceptive. An ethical company should get rid of those no-op preferences settings altogether.

dartos • 1 hour ago

I think the goal is to enable users to avoid content they don’t wish to see, not prevent content from going to specific places.

perihelions • 1 hour ago

No, in this case they are talking about consent about where their public posts are scraped to. I was quoting and replying to this part:

- "Bluesky said that it’s looking at ways to enable users to communicate their consent preferences externally, though it’s up to those parties whether they respect those preferences. The company posted: “Bluesky won’t be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We’re having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!”.

dartos • 1 hour ago

Yes, this particular feature isn’t effective at stopping content from going somewhere unwanted, but I think the main design goal of bsky and the AT protocol isn’t that.

tacticus • 2 hours ago

HF wouldn't exist if it was an ethical company.

puppycodes • 2 hours ago

says some 60+ npr guy who thinks AI wants his watermarked digital slr pic of a sunset so it can profit from his creative genious.

darkotic • 1 hour ago

oh no :scream:

dangus • 2 hours ago

As soon as someone figures out how to DCMA the AI industry the whole industry will become enshittified. It all relies on copyright infringement and “generative” AI doesn’t generate as much as it originally promised, it’s more like an extremely advanced search engine with the ability to combine and edit the source data.

An analogy is music producers who sample other tracks, who most definitely have to pay royalties.

If it was as easy to detect trained AI data as it is to detect a music video or movie in a YouTube video, every AI company would be toast.

You’d basically have a category of application whose cost is so high it’s hard to justify. You’ve gotta run the world’s most expensive type of computing to do the AI training and you have to license a massive amount of work from copyright owners for it to have any use.

To swing this back to being more related to the article at hand, I think that being open to the public can be okay if BlueSky or people who publish content on the platform are better able to exert their rights under copyright law. When I post something online I shouldn’t be giving up my copyright rights just because it’s hard to enforce.

If there was a law that was truly progressive about online privacy it would protect individuals’ intellectual property rights more on social networks. A social media company shouldn’t magically get to own my content just because they said so in their EULA.

NavinF • 2 hours ago

That's not how the scaling laws work. The number of samples required to reach a given quality level reduces exponentially over time. Most researchers use small datasets.

dangus • 1 hour ago

Interesting, because in The New York Times' lawsuit there is a very large block of text repeated verbatim. Page 30: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

How much of a copyrighted work do I have to copy and reproduce/redistribute to violate copyright law? Am I allowed to sell my handheld recording of the last two minutes of Gladiator 2 for $1.99 at the flea market?

amanaplanacanal • 2 hours ago

You don't have to give your data to social media companies, you know. That's just part of the trade off in using their apps.