What's the license for the Bluesky data btw? Is it something free to mirror and train LLMs on?
So the ToS explicitly says Bluesky does NOT own your data.
However, data on AT Proto is fully public and it’d be trivial for someone to extract the data for AI to train.
For example, this app shows you entries hosted on the protocol: https://atproto-browser.vercel.app/at/nytimes.com
Based on https://bsky.social/about/support/tos#user-content , I would answer yes. While it's not expressly called out (permitted or forbidden), my reading of the above would indicate that it's not forbidden per se, and probably permitted ("Modify or otherwise utilize User Content in any media. This includes reproducing, preparing derivative works, distributing, performing, and displaying your User Content."). I believe training an LLM falls under "utilize" and "preparing derivative works".