The article takes that angle because Bluesky users are currently harassing and threatening a researcher. Read the replies to his original thread: https://bsky.app/profile/danielvanstrien.bsky.social/post/3l...
From the article:
>Per a report by 404 Media, Daniel van Strien, a machine learning librarian at AI firm Hugging Face, pulled 1 million public posts from Bluesky via its Firehose API for machine learning research, pushing the dataset to a public repository. Van Strien later removed the data due to the controversy that ensued; however, it serves as a timely reminder that everything you post publicly to Bluesky is, well, public.
So what was the license on the data?
HF certainly have a restrictive license on data they produce.
"Harassing and threatening" by asking him not to use their data illegally and without consent?
> By creating an account you agree to the Terms of Service and Privacy Policy.
> The Bluesky App is a microblogging service for public conversation, so any information you add to your public profile and the information you post on the Bluesky App is public.
What did they expect? I don't know how they could make it more clear. It doesn't say "information you post will only be used by people you like". It is public.
Anybody can write like 10 lines of Python and start consuming the firehose.
Copyright laws don't cease to be a thing the moment you post something on the internet. The words are still yours. If this was X/Reddit/Facebook or the like instead of Bluesky the researcher would have immediately found himself on the wrong end of a DMCA takedown request and maybe even a lawsuit.
A lawsuit such as LinkedIn v. hiQ ?
The concluding scraping publically accessible data was not a violation of CFAA, after which Twitter et al went logged-in-users-only?
I don't know if copyright comes into this. While social media terms of service are very clear about licensing the comments of individual users, that's to protect them from what the law doesn't say implicitly. Is every comment posted to Twitter or Bluesky a "literary work" ? An original, creative expression ? I have my doubts but I guess there's room for a lawsuit yet.
Why are Google, OpenAI etc. buying user generated data from Reddit and other similar sites for hundreds of millions of dollars a year? If Reddit comments aren't copyrighted and aren't original creative work then anyone should just be free to scrape them directly right?
I didn't see a lot of asking in that thread. They should probably organize and sue if they want to clarify what is legal, as is the bulk of the comments I read are denigrating and insulting
> Your job is making the world a worse place
> Hey man, fuck you
> fuck you nerd
> Take your AI horseshit out of here you horrible piece of human trash.
> have you considered killing yourself
> This is garbage, you're a thief, and you disgust me.
> Throw yourself off a building [thumbs up emoji]
> Somebody cyberbully this guy
See also the comments where Daniel apologizes. Bluesky hasnt changed anything about the shape of Twitter and the kind of harassment that occurs. A mob of people have some vague enemy (AI bros) and pile on the first target of opportunity directing all their "kill yourself" energy at one individual at a time.
https://bsky.app/profile/danielvanstrien.bsky.social/post/3l...
> hey man i don't know if anyone's told you yet but nobody likes you, you stink, your ideas are stupid, ai is loser shit, and you should go fuck yourself
> Get off this website, you fucking ghouls.
> Just go back to Twitter you Ai trash. Create your own work and stop trying to make money off of everyone else’s. No one wants you here.
> You're not sorry, you lying sack of shit. Get the fuck off this site. Better yet, get off the internet entirely, go live in the mountains and be a hermit for the next 50 years, you rancid fucking malignance.
> I’m using generative AI to make a rendering of you in a mass grave.
etc.
Keep in mind this the post where he takes it down.