AI bots strain Wikimedia as bandwidth surges 50%
arstechnica.com
Tales from the digital commons AI bots strain Wikimedia as bandwidth surges 50% Automated AI bots seeking training data threaten Wikipedia project stability, foundation says. Benj Edwards Apr 2, 2025 1:06 pm | 42 Credit: Carol Yepes and Dana Neibert via Getty Images Credit: Carol Yepes and Dana Neibert via Getty Images Story textSizeSmallStandardLargeWidth *StandardWideLinksStandardOrange* Subscribers only Learn moreOn Tuesday, the Wikimedia Foundation announced that relentless AI scraping is putting strain on Wikipedia's servers. Automated bots seeking AI model training data for LLMs have been vacuuming up terabytes of data, growing the foundation's bandwidth used for downloading multimedia content by 50 percent since January 2024. Its a scenario familiar across the free and open source software (FOSS) community, as we've previously detailed.The Foundation hosts not only Wikipedia but also platforms like Wikimedia Commons, which offers 144 million media files under open licenses. For decades, this content has powered everything from search results to school projects. But since early 2024, AI companies have dramatically increased automated scraping through direct crawling, APIs, and bulk downloads to feed their hungry AI models. This exponential growth in non-human traffic has imposed steep technical and financial costsoften without the attribution that helps sustain Wikimedias volunteer ecosystem.The impact isnt theoretical. The foundation says that when former US President Jimmy Carter died in December 2024, his Wikipedia page predictably drew millions of views. But the real stress came when users simultaneously streamed a 1.5-hour video of a 1980 debate from Wikimedia Commons. The surge doubled Wikimedias normal network traffic, temporarily maxing out several of its Internet connections. Wikimedia engineers quickly rerouted traffic to reduce congestion, but the event revealed a deeper problem: The baseline bandwidth had already been consumed largely by bots scraping media at scale.This behavior is increasingly familiar across the FOSS world. Fedoras Pagure repository blocked all traffic from Brazil after similar scraping incidents covered by Ars Technica. GNOMEs GitLab instance implemented proof-of-work challenges to filter excessive bot access. Read the Docs dramatically cut its bandwidth costs after blocking AI crawlers.Wikimedias internal data explains why this kind of traffic is so costly for open projects. Unlike humans, who tend to view popular and frequently cached articles, bots crawl obscure and less-accessed pages, forcing Wikimedias core datacenters to serve them directly. Caching systems designed for predictable, human browsing behavior dont work when bots are reading the entire archive indiscriminately.As a result, Wikimedia found that bots account for 65 percent of the most expensive requests to its core infrastructure despite making up just 35 percent of total pageviews. This asymmetry is a key technical insight: The cost of a bot request is far higher than a human one, and it adds up fast.Crawlers that evade detectionMaking the situation more difficult, many AI-focused crawlers do not play by established rules. Some ignore robots.txt directives. Others spoof browser user agents to disguise themselves as human visitors. Some even rotate through residential IP addresses to avoid blocking, tactics that have become common enough to force individual developers like Xe Iaso to adopt drastic protective measures for their code repositories.This leaves Wikimedias Site Reliability team in a perpetual state of defense. Every hour spent rate-limiting bots or mitigating traffic surges is time not spent supporting Wikimedias contributors, users, or technical improvements. And its not just content platforms under strain. Developer infrastructure, like Wikimedias code review tools and bug trackers, is also frequently hit by scrapers, further diverting attention and resources.These problems mirror others in the AI scraping ecosystem. Curl developer Daniel Stenberg has detailed how fake, AI-generated bug reports are wasting human time. SourceHuts Drew DeVault has highlighted how bots hammer endpoints like git logs, far beyond what human developers would ever need.Across the Internet, open platforms are experimenting with technical solutions: proof-of-work challenges, slow-response tarpits (like Nepenthes), collaborative crawler blocklists (like "ai.robots.txt"), and commercial tools like Cloudflare's AI Labyrinth. These approaches address the technical mismatch between infrastructure designed for human readers and the industrial-scale demands of AI training.Open commons at riskWikimedia acknowledges the importance of providing "knowledge as a service," and its content is indeed freely licensed. But as the Foundation states plainly, "Our content is free, our infrastructure is not."The organization is now focusing on systemic approaches to this issue under a new initiative: WE5: Responsible Use of Infrastructure. It raises critical questions about guiding developers toward less resource-intensive access methods and establishing sustainable boundaries while preserving openness.The challenge lies in bridging two worlds: open knowledge repositories and commercial AI development. Many companies rely on open knowledge to train commercial models but don't contribute to the infrastructure making that knowledge accessible. This creates a technical imbalance that threatens the sustainability of community-run platforms.Better coordination between AI developers and resource providers could potentially resolve these issues through dedicated APIs, shared infrastructure funding, or more efficient access patterns. Without such practical collaboration, the platforms that have enabled AI advancement may struggle to maintain reliable service. Wikimedia's warning is clear: Freedom of access does not mean freedom from consequences.Benj EdwardsSenior AI ReporterBenj EdwardsSenior AI Reporter Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC. 42 Comments
0 Comments ·0 Shares ·32 Views