When Twitter/X’s updated Terms of Service, which previously caused quite a stir for essentially allowing the platform to train its AI models on your posts without the option to opt-out, came into effect on November 15, many people quickly packed their luggage and flocked to other alternative platforms in an effort to avoid their content being repurposed for machine learning.
One such platform you’ve likely heard of, and perhaps even registered on, is Bluesky, a Twitter-like microblogging social network that recently crossed the milestone of 20 million monthly active users, a figure that might not seem that impressive compared to X’s over 600 million MAUs but is nonetheless commendable given it’s triple the size it was just three months ago.
Unfortunately for the millions of new users – many of whom joined Bluesky specifically to avoid the AI-related issue – it appears that their new platform of choice is also subject to scraping campaigns similar to other social networks, with recent findings revealing that hundreds of millions of Bluesky posts are being collected into public datasets, ready to be used for training chatbots.
Bluesky/Getty
The first reports about Bluesky post datasets being created by ML devotees began popping up in late November when Daniel van Strien, a machine learning librarian at Hugging Face, published a dataset containing 1 million public posts extracted from Bluesky’s firehose API, with each post including text content, metadata, details about media attachments, and information on reply relationships.
The dataset’s description stated that it was “intended for machine learning research and experimentation with social media data,” and given that each post was accompanied by its author’s decentralized identifier, meaning the collected publications were not anonymous, many users understandably took issue with the situation, causing pushback significant enough that the dataset’s creator chose to remove the data from it and apologize for violating “principles of transparency and consent in data collection.”
I’ve removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.
— Daniel van Strien (@danielvanstrien.bsky.social) November 27, 2024 at 7:19 AM
Sadly, the story didn’t end there, and following the removal of van Strien’s dataset, several other, even larger datasets surfaced online, including one containing nearly 300 million non-anonymous posts – approximately 42.5% of all posts shared on the platform, as per Jaz’s Bluesky index.
In her latest report, 404 Media’s Samantha Cole uncovered several datasets containing more posts than van Strien’s, including one by Alpin Dale with 2 million posts and another by Alim Maasoglu with 8 million. According to their descriptions, Alpin’s dataset is intended for “training and testing language models on social media content, analyzing social media posting patterns, and studying conversation structures and reply networks,” while Alim’s is meant for “social media content analysis, language processing research, trend analysis, and content recommendation systems.”
Both of them, however, pale in comparison to the dataset shared by Hugging Face user GAYSEX, which contains a staggering 298 million text posts scraped from Bluesky. In the project description, the author noted that they are advocating for internet users to leave social media platforms like Twitter, Facebook, TikTok, and Instagram in favor of old-school forums, encouraging people to “find your own circles by searching.” Unlike the previously mentioned datasets, this one lacks a comprehensive description outlining potential uses, structure, and configuration, revealing only that the list of posts is “too unfiltered, so there’s gonna be a lot of work that needs to be done” if one wanted to use it for AI training.
We highly encourage you to read Cole’s original story to learn more about how your Bluesky publications can be used for AI training.
While we’re on the topic of ML scraping on social networks, it would be remiss not to mention that the October report about Twitter using your posts for AI training whether you like it or not – referenced in the first paragraph of this article – has since become outdated. At some point between October 17 and today, X modified the “Your Rights and Grant of Rights in the Content” clause in its ToS, bringing back the part about respecting users’ choice to limit the distribution of their content via Twitter’s built-in functions, meaning that there is, in fact, opting out now, and I should probably change the October article’s title.
Here’s how the clause looked prior to the October report:
During (the part about respecting the choice is nowhere to be found):
And here’s how it looks now, with the previously removed part put back:
Don’t forget to join our 80 Level Talent platform and our new Discord server, follow us on Instagram, Twitter, LinkedIn, Telegram, TikTok, and Threads, where we share breakdowns, the latest news, awesome artworks, and more.